kNNT_SSTD_2011 - Microsoft Research

advertisement
Retrieving k-Nearest Neighboring
Trajectories by a Set of Point Locations
Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan,
Xiao Yu, Jiawei Han
•University of Illinois at Urbana-Champaign
Microsoft Research Asia
Motivation: trajectory query by locations
User
Geo-tagged photos
Taxi trajectories
Check-ins
Huge volume of spatial trajectories
Require to search trajectories by a set of point locations
2
k-Nearest Neighboring trajectory query
q1
q2
q3
Query the top k trajectories with the minimum aggregated
distance to the given locations
The trajectories may not exactly pass those locations
3
k-NNT query
Task Definition:
Given the trajectory dataset D, and a set of query points, Q, the k-NNT query
retrieves k trajectories K from D, K = {R1, R2, …, Rk} that for ∀ Ri ∈ K, ∀ Rj ∈ D K, dist(Ri,Q) ≤ dist(Rj,Q).
Challenges
Huge trajectory dataset: High I/O cost to scan all the trajectories
Aggregated distance computation
Non-uniform distribution:
the trajectories are sparse/dense in different regions
the user-given query locations may be far from all the trajectories
4
The aggregate distance in k-NNT query
1. Find out the closest point from a trajectory to each query point
(i.e., shortest matching pairs)
3. Sum up the lengths of all matching pairs
R1
R2
p1,1
p2,1
p1,3
p1,2
q1
p1,4
q2
p2,3
p2,2
•dist(R1, q1)= dist(p1,2, q1)= 20 m
•dist(R1, q2)= dist(p1,3, q2)= 50 m
•dist(R1, q3)= dist(p1,5, q3)= 15 m
•dist(R1, Q)=∑ dist(R1, qi)= 85 m
p2,4
p2,5
p1,5
q3
p2,6
•dist(R2, q1)= dist(p2,3, q1)= 30 m
•dist(R2, q2)= dist(p2,4, q2)= 5 m
•dist(R2, q3)= dist(p2,6, q3)= 40 m
•dist(R2, Q)=∑ dist(R2, qi)= 75 m
5
Related Work: k-BCT query
k-Best Connected Trajectory (k-BCT) query [SIGMOD2010]
the similarity function between a trajectory R and query locations Q is
Problem: This function changes over units (inconsistent)
An example
If query Q has two points q1 and q2;
dist(R1, q1) = dist(R1, q2) = 2.4km = 1.48 miles,
dist(R2, q1) = 1.5 km =0.93 miles, dist(R2, q2) = 5km = 3.1 miles
Use unit “mile”, Sim(R1, Q) = 0.45 > Sim(R2, Q) = 0.43
Use unit “km”, Sim(R1, Q) = 0.18 < Sim(R2, Q) = 0.22
6
Advantages of k-NNT over k-BCT
The distance function of k-BCT changes over units (inconsistent)
The distance function of k-BCT is sensitive to a query
•k-BCT
•k-BCT&k-NNT
•k-NNT
q1
q2
q3
7
Query framework: candidate-generation-and-verification
Candidate generation
R1
Best-first search based
individual heaps
Coordination by a global
heap
R2 R3 R4
q1
Direct
Computing dist(R1, Q)= 5+2+2=9 m
dist(R2, Q)= 25+20+30=75m
dist(R3, Q)= 80+25+30=135m
dist(R4, Q)= 90+5+3=98 m
dist(R5, Q)= 55+8+70=123m
dist(R6, Q)= 120+80+40=240 m
R5
q2
q3
R6
Candidate verification
Candidate
Generation
Lower-bound estimation
R1
Efficient pruning with the
R
global heap
R4
q1
Candidate
Verification
5
q2
Outlier query location
Qualifier expectation based
method
dist(R1, Q)= 5+2+2=9 m
dist(R4, Q)= 90+5+3=98 m
dist(R5, Q)= 55+8+70=123m
q3
8
Candidate Generation
Given a query Q = {q1, q2, …, qm}, generate a trajectory
candidate set including all the k-NNTs (i.e., complete set)
Step 1: searching k-NN points using best-first-based individual
heap
Step 2: generating the candidate trajectories by the global heap
R1
<p2,3, q1>
<p5,2, q1>
<p1,6, q1>
<p2,9, q1>
…...
h1
<p6,2, q2>
<p5,3, q2>
<p7,4, q2>
<p4,8, q2>
…...
h2
<p2,2, q3>
<p3,5, q3>
<p7,3, q3>
<p8,6, q3>
…...
h3
R2 R3 R4
q1
R5
q2
R6
q3
9
Step 2: generating candidate trajectories
Global heap
A minimum heap sorting matching pairs by the distance
Retrieves new matching pair from individual heaps
Pops the matching pairs to the candidate set
Advantages
guarantee including all kNNTs in candidate set
generate compact candidate
sets
R1: <p1,2, q1>, <p1,5, q2>, <p1,3, q3>, ……, <p1,9, qm>.
R2:
, <p2,2, q2>, <p2,4, q3>, ……,
.
R4: <p4,5, q1>,
, <p4,3, q3>, ……, <p4,7, qm>
………...
Candidate Set
Global Heap (Size=m)
<p1,4, q1>, <p5,1, q3>, <p6,4, q4>, <p3,4, q2>, …...
h1
<p2,3, q1>
<p5,2, q1>
<p1,6, q1>
<p2,9, q1>
…...
h2
<p6,2, q2>
<p5,3, q2>
<p7,4, q2>
<p4,8, q2>
…...
hm
h3 Individual Heaps
<p2,2, q3>
<p5,1, qm>
<p3,5, q3>
<p2,3, qm>
<p7,3, q3>
<p5,7, qm>
…...
<p8,6, q3>
<p9,2, qm>
…...
…...
10
Example: Search based on the global heap
R1
R3
R2
R4
p1,2
q1
Candidate Set
R5
p5,5
q2
Global Heap
p4,4
p1,4
p4,5
h1
p1,6
q3
•<p1,2, q1>
……
h2
•<p1,4, q2>
……
h3
•<p1,6, q3>
……
Individual Heaps
11
Example: Search based on the global heap
R1
R3
R2
R4
•R1:
(Partial Match)
p1,2
q1
Candidate Set
R5
p5,5
q2
p4,4
p1,4
Global Heap
•<p1,4, q2>
•<p1,6, q3>
•<p1,2, q1>
p4,5
h1
p1,6
h2
q3
……
•<p5,5, q2>
……
h3
……
Individual Heaps
12
Example: Search based on the global heap
R1
R3
R2
R4
•R1:
•<p1,4, q2>
(Partial Match)
p1,2
q1
Candidate Set
R5
p5,5
q2
p4,4
p1,4
Global Heap
•<p1,6, q3>
•<p5,5, q2>
•<p1,2, q1>
p4,5
h1
p1,6
h2
h3
q3
……
……
•<p4,5, q3>
……
Individual Heaps
13
Example: Search based on the global heap
R1
R3
R2
R4
p1,2
q1
•R1:
•<p1,4, q2>•<p1,6, q3> (Partial Match)
•R5:
(Partial Match)
Candidate Set
R5
p5,5
q2
p4,4
p1,4
Global Heap
•<p5,5, q2>
•<p4,5, q3>
•<p1,2, q1>
p4,5
h1
p1,6
q3
h2
•<p4,4, q2>
……
h3
……
……
Individual Heaps
14
Example: Search based on the global heap
R1
R3
R2
R4
R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match)
R4:
p1,2
R5:
q1
<p4,5, q3>.
<p5,5, q2>.
(Partial Match)
(Partial Match)
Candidate Set
R5
p5,5
q2
Global Heap
<p1,2, q1>, <p4,4, q2>, <p1,5, q3>
p4,4
p1,4
p4,5
h1
p1,6
q3
h2
……
h3
……
……
Stop critiria: when there is k full-matching
candidates – Property 1: The candidate set is
complete if G has popped out k full-matching
candidates
(InHeaps
this example k=1)
Individual
15
Candidate verification
R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match)
R4:
R5:
<p4,5, q3>.
<p5,5, q2>.
Candidate Set
(Partial Match)
(Partial Match)
The full-matching candidate may not be the final k-NNT
The system has to retrieve the partial-matching trajectories
(R4 and R5) to compute their aggregate distance (I/O cost)
Question: can we compute a lower-bound for R4 and R5
without retrieving their details?
If LB(R4/5) > dist(R1,Q), we can prune it directly
16
Candidate verification
The lower-bound of a partial-matching trajectory is
If the LB(R) is larger than the distance of full-matching
candidate, R can be pruned directly
R1: <p1,2, q1> <p1,4, q2> <p1,6, q3> dist(R1) = 95
<p4,5, q3> LB(R4) =114 (pruned)
R4:
R5:
Candidate Set
<p5,5, q2>
•<p1,2, q1>
•<p4,4, q2>
LB(R5) =90 (passed)
•<p1,5, q3>
Global Heap
17
Problem of Outlier Query Location
A query location is an outlier if it is far from all the trajectories
Too many partial-matching candidates will be generated before
finding a full-matching candidates
R1
R2 p2,1
R3
q1
p1,2
R4
Candidate Set
R1: <p1,1, q1>, <p1,4, q2>, . (Partial Matching)
R2: <p2,1, q1>, <p2,5, q2>, . (Partial Matching)
R4:
, <p4,4, q2>, . (Partial Matching)
p2,2
<p1,7, q3> cannot
be popped out
q2
p1,4
p2,5
Global Heap
<p2,5, q2>, <p2,1, q1>, <p1,7, q3> Iteration 1
p4,4
p1,7
p2,6
<p2,1, q1>, <p1,4, q2>, <p1,7, q3> Iteration 2
<p1,4, q2>, <p1,1, q1>, <p1,7, q3> Iteration 3
q3
<p1,1, q1>, <p4,4, q2>, <p1,7, q3> Iteration 4
……
18
Qualifier expectation based method
The system can make up the missing pairs of a partialmatching trajectory by retrieving all its points
Two key issues:
Guarantee the completeness of candidate set
Property 2: If there are k made-up candidates (qualifier) with distance smaller
than the sum of the pairs in global heap, the candidate set is complete
Which candidate should be selected to make up?
The qualifier expectation measure
19
Example of Qualifier Expectation
R1
R2 p2,1
R3
p2,2
q2
R1: 40m. R2: 30m. R4: 15m.
p1,4
p2,5
p4,4
p1,7
q3
dist(R1) =160m < sum(G), R1 is a qualifier
•R1: <p1,1, q1>, <p1,4, q2>, <p1,7, q3>.
q1
p1,2
R4
p2,6
Qualifier Expectation
R1: <p1,1, q1>, <p1,4, q2>,
.
R2: <p2,1, q1>, <p2,5, q2>,
.
R4:
.
,<p4,4, q2>,
Candidate Set
Global Heap, total dist sum(G) = 200m
<p2,1, q1>, <p4,4, q2>, <p1,7, q3>
20
Experiment Setup
Real Dataset: collected from the
Microsoft Geolife and T-Drive
projects , with over 20,000 real
trajectories
Synthetic datasets with both
uniform distribution and biased
distribution
Random generated query Q
The proposed methods are
compared with Fagin’s
Algorithm (FA) and Threshold
Algorithm (TA) (used in kBCT)
21
Evaluations on synthetic dataset (biased distribution)
GH (global heap) is faster than baselines with less I/O costs
QE( global heap+ qualifier expectation ) is an order of magnitude faster than
others
GH
100000
QE
100000
TA
10000000
Time (unit: ms)
10000
1000000
1000
100000
FA
Accessed Rtree Nodes
10000
1000
10000
100
3k
6k
100
9k
12k
1000
2
4
6
8
(a) Query Time vs. |Q|
10
2
4
6
8
10
(b) I/O Cost vs. |Q|
22
Evaluations on real dataset
When |Q| is small, the probability of outlier location is low, GH achieves the
best performance
When |Q| is larger, the probability of outlier location is high, QE is more
efficient
GH
10000
QE
100000
TA
1000000
Time (unit: ms)
FA
Accessed Rtree Nodes
10000
1000
100000
1000
100
10000
100
3k
6k
10
9k
12k
1000
2
4
6
8
(a) Query Time vs. |Q|
10
2
4
6
8
10
(b) I/O Cost vs. |Q|
23
Conclusion
k-Nearest Neighboring Trajectory (k-NNT) query
retrieve trajectories by a set of locations
Candidate-generation-and-verification framework
Generate candidate trajectories with global heap
Efficient lower-bound computation
Outlier query location: qualifier expectation based method
24
Thanks very much!
Any Questions?
25
Download