Reverse Spatial and Textual k Nearest Neighbors

advertisement
Reverse Spatial and Textual k
Nearest Neighbor Search
Jiaheng Lu, Ying Lu
Gao Cong
Renmin University of China Nanyang Techonlogical
University, Singapore
Outline
•
•
•
•
•
Motivation & Problem Statement
Related Work
RSTkNN Search Strategy
Experiments
Conclusion
1
Motivation
• If add a new shop at
Q, which shops will
be influenced?
clothes
food
• Influence facts
– Spatial Distance
• Results: D, F
– Textual Similarity
• Services/Products...
• Results: F, C
clothes
clothes
sports
food
clothes
2
Problems of finding Influential Sets
Traditional query
Reverse k nearest neighbor query (RkNN)
Our new query
Reverse spatial and textual k nearest neighbor
query (RSTkNN)
3
Problem Statement
Spatial-Textual Similarity
• describe the similarity between such objects based
on both spatial proximity and textual similarity.
Spatial-Textual Similarity Function
 

a b

 2  
EJ (a, b )   2
|| a ||  || b || a  b
4
Problem Statement (con’t)
• RSTkNN query
– is finding objects which have the query object
as one of their k spatial-textual similar objects.
5
Outline
•
•
•
•
•
Motivation & Problem Statement
Related Work
RSTkNN Search Strategy
Experiments
Conclusion
6
Related Work
• Pre-computing the kNN for each object
(Korn ect, SIGMOD2000, Yang ect, ICDE2001)
• (Hyper) Voronio cell/planes pruning strategy
(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)
• 60-degree-pruning method
(Stanoi ect, SIGMOD2000)
• Branch and Bound (based on Lp-norm metric space)
(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)
Challenging Features:
• Lose Euclidean geometric properties.
• High dimension in text space.
• k and α are different from query to query.
7
Baseline method
For each object o in the database
Precompute
Spatial NNs
Textual NNs
Threshold Algorithm
Give
query q,
k&α
q is no more
similar than o’
Object o
Spatial-textual kNN o’
q is more
similar than o’
Inefficient since lacking a novel data structure
8
Outline
•
•
•
•
•
Motivation & Problem Statement
Related Work
RSTkNN Search Strategy
Experiments
Conclusion
9
Intersection and Union R-tree (IUR-tree)
N1
p1
p1
p2
p3
p4
p5
q
p4
N3
p5
p2
x
or
d2
w
1
1
5
8
1
8
or
N4
[4,4]
[4,4]
1
w
1
1
5
8
1
8
d1
d1
p3
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
or
w
N2
vectors
[0,0]
[1,1]
2
[3,1.5]
[3.5,2.5]
2
IntUniVct1
or
N4
y
4
1
0
2.5
1.5
2.5
w
q(0.5, 2.5)
x
4
0
1
3
3.5
0.5
d2
y
IntVct1
1
1
IntVct2
1
1
IntUniVct3
IntVct3
1
1
[3.5, 1.5]
[3.5, 1.5]
UniVct1
1
1
UniVct2
UniVct3
5
8
5
8
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
p1
p2
p3
p4
p5
w
d2
[1, 0]
[1, 0]
N3
[3, 2.5]
[3, 2.5]
w
N2
[0, 1]
[0, 1]
or
N1
[4,4]
[4,4]
or
d1
IntUniVct2
10
Main idea of Search Strategy
kNNL(E)
Prune an entry E in IUR-Tree,
when query q is no more
similar than kNNL(E).
Report an entry E to be
results, when query q is more
similar than kNNU(E).
kNNU(E)
L
u
E
q1
query
objects
q2
q3
11
How to Compute the Bounds
Similarity approximations
MinST(E, E’):
o' E ' , o  E, SimST (o, o' )  MinST ( E, E ' )
TightMinST(E, E’):
o' E ' , o  E, SimST (o, o' )  TightMinST ( E, E ' )
MaxST(E, E’):
o' E ' , o  E, SimST (o, o' )  MaxST ( E, E ' )
12
Example for Computing Bounds
y
N1
Current traveled entries: N1, N2, N3
Given k=2, to compute kNNL(N1) and kNNU(N1).
p1
q(0.5, 2.5)
N4
p4
N3
effect
p5
N1
N3
N1
N2
p2
N2
p3
x
Compute kNNU(N1)
Compute kNNL(N1)
TightMinST(N1, N2) = 0.179
MinST(N1, N2) = 0.095
MaxST(N1, N3) = 0.432
MaxST(N1, N2) = 0.150
decrease
MinST(N1, N3) = 0.370
decrease
TightMinST(N1, N3) = 0.564
kNNU(N1) = 0.432
kNNL(N1) = 0.370
13
Overview of Search Algorithm
• RSTkNN Algorithm:
– Travel from the IUR-tree root
– Progressively update lower and upper bounds
– Apply search strategy:
•
•
•
prune unrelated entries to Pruned;
report entries to be results Ans;
add candidate objects to Cnd.
– FinalVerification
• For objects in Cnd, check whether to results or not by updating
the bounds for candidates using expanding entries in Pruned.
14
Example: Execution of the RSTkNN Algorithm on IUR-tree,
given k=2, alpha=0.6
y
N1
p1
N4
p4
q(0.5, 2.5)
N4
N3
p5
p2
N2
p1
p2
p3
p4
p5
q
x
4
0
1
3
3.5
0.5
x
p3
y
4
1
0
2.5
1.5
2.5
vectors
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
w
1
1
5
8
1
8
or
d1
or
w
1
1
5
8
1
8
d2
N1
N2
p1
p2
N3
p3
p4
p5
Initialize N4.CLs;
EnQueue(U, N4);
U
N4, (0, 0)
15
Example: Execution of the RSTkNN Algorithm on IUR-tree,
y
given k=2, alpha=0.6
N1
p1
N4
p4
q(0.5, 2.5)
N4
N3
p5
p2
N2
p1
p2
p3
p4
p5
q
Pruned
U
x
4
0
1
3
3.5
0.5
x
p3
y
4
1
0
2.5
1.5
2.5
vectors
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
w
1
1
5
8
1
8
d
or
1
o
w
1
1
5
8
1
8
2
rd
N1
N2
p1
p2
DeQueue(U, N4)
EnQueue(U, N2)
EnQueue(U, N3)
Pruned.add(N1)
N3
p3
p4
p5
Mutual-effect
N1
N2
N1
N3
N2
N3
N1(0.37, 0.432)
N4(0, 0.619
0)
N3(0.323,
)
N2(0.21, 0.619 )
16
Example: Execution of the RSTkNN Algorithm on IUR-tree,
y
given k=2, alpha=0.6
N1
p1
N4
p4
q(0.5, 2.5)
N4
N3
p5
p2
N2
p1
p2
p3
p4
p5
q
Pruned
U
x
4
0
1
3
3.5
0.5
x
p3
y
4
1
0
2.5
1.5
2.5
vectors
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
w
1
1
5
8
1
8
or
d1
or
w
1
1
5
8
1
8
d2
N1
N2
p1
p2
p3
DeQueue(U, N3)
Answer.add(p4)
Candidate.add(p5)
Answer
N1(0.37, 0.432)
N3(0.323, 0.619 )
N3
N2(0.21, 0.619 )
Candidate
p5
p4
p4
Mutual-effect
p5
N2
p4,N2
p4(0.21, 0.619 )
p5(0.374, 0.374)
17
Example: Execution of the RSTkNN Algorithm on IUR-tree,
y
given k=2, alpha=0.6
N1
p1
N4
p4
q(0.5, 2.5)
N4
N3
p5
p2
N2
p1
p2
p3
p4
p5
q
x
4
0
1
3
3.5
0.5
x
p3
y
4
1
0
2.5
1.5
2.5
vectors
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
w
1
1
5
8
1
8
or
d1
or
w
1
1
5
8
1
8
d2
N1
N2
p1
p2
N3
p3
DeQueue(U, N2)
Answer.add(p2, p3)
Pruned.add(p5)
So far since U=Cand=empty,
algorithm ends.
Results: p2, p3, p4.
Pruned
U
N1(0.37, 0.432)
N2(0.21, 0.619 )
p5
p4
Mutual-effect
p2
p4,p5
p3
p2,p4,p5
Answer
Candidate
p4
p2
p3
p5(0.374, 0.374)
18
Cluster IUR-tree: CIUR-tree
IUR-tree: Texts in an index node could be very different.
CIUR-tree: An enhanced IUR-tree by incorporating textual clusters.
p1
p1
p2
p3
p4
p5
q
p4
p5
C1:1
IntUniVct1
x
[0,0]
[1,1]
2
[3,1.5]
[3.5,2.5]
2
IntVct1
1
1
IntVct2
1
1
IntVct3
1
1
d1
C2:2
IntUniVct2
C1:1, C3:1
IntUniVct3
or
or
N4
[4,4]
[4,4]
1
d2
o r cluster
w
C1
1
C2
5
C2
5
C3
8
C1
1
8
d1
p3
w
1
5
5
8
1
8
d1
w
N2
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
or
N2
[0, 1]
[0, 1]
[1, 0]
[1, 0]
N3
[3, 2.5]
[3, 2.5]
[3.5, 1.5]
[3.5, 1.5]
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
p1
p2
p3
p4
p5
w
or
N1
[4,4]
[4,4]
or
N3
p2
vectors
w
N4
y
4
1
0
2.5
1.5
2.5
w
q(0.5, 2.5)
x
4
0
1
3
3.5
0.5
d2
N1
d2
y
UniVct1
1
1
UniVct2
UniVct3
5
8
5
8
19
Optimizations
• Motivation
– To give a tighter bound during CIUR-tree traversal
– To purify the textual description in the index node
• Outlier Detection and Extraction (ODE-CIUR)
– Extract subtrees with outlier clusters
– Take the outliers into special account and calculate their bounds
separately.
• Text-entropy based optimization (TE-CIUR)
– Define TextEntropy to depict the distribution of text clusters in an
entry of CIUR-tree
– Travel first for the entries with higher TextEntropy, i.e. more diverse
in texts.
20
Experimental Study
•
Experimental Setup
–
–
•
Memory: 4GB
baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE.
Datasets
–
–
–
•
CPU: 2.0GHz;
Language: C/C++.
Compared Methods
–
•
OS: Windows XP;
Page size: 4KB;
ShopBranches(Shop), extended from a small real data
GeographicNames(GN), real data
CaliforniaDBpedia(CD), generated combining location in California and documents from
DBpedia.
Metric
–
–
Total query time
Page access number
Statistics
Shop
CD
GN
Total # of objects
304,008
1,555,209
1,868,821
Total unique words in dataset
3933
21,578
222,409
Average # words per object
45
47
4
21
Scalability
0.2K
(1) Log-scale version
3K
40K
550K
4M
(2) Linear-scale version
22
Effect of k
(a) Query time
(b) Page access
23
Conclusion
• Propose a new query problem RSTkNN.
• Present a hybrid index IUR-Tree.
• Present the efficient search algorithm to answer
the queries.
• Show the enhancing variant CIUR-Tree and two
optimizations ODE-CIUR and TE-CIUR to further
improve search processing.
• Extensive experiments confirm the efficiency
and scalability of our algorithms.
24
Reverse Spatial and Textual k
Nearest Neighbor Search
Thanks!
Q&A
A straightforward method
1. Compute RSkNN and RTkNN, respectively;
2. Combine both results of RSkNN and RTkNN get
RSTkNN results.
No sensible way for combination. (Infeasible)
Download