Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han •University of Illinois at Urbana-Champaign Microsoft Research Asia Motivation: trajectory query by locations User Geo-tagged photos Taxi trajectories Check-ins Huge volume of spatial trajectories Require to search trajectories by a set of point locations 2 k-Nearest Neighboring trajectory query q1 q2 q3 Query the top k trajectories with the minimum aggregated distance to the given locations The trajectories may not exactly pass those locations 3 k-NNT query Task Definition: Given the trajectory dataset D, and a set of query points, Q, the k-NNT query retrieves k trajectories K from D, K = {R1, R2, …, Rk} that for ∀ Ri ∈ K, ∀ Rj ∈ D K, dist(Ri,Q) ≤ dist(Rj,Q). Challenges Huge trajectory dataset: High I/O cost to scan all the trajectories Aggregated distance computation Non-uniform distribution: the trajectories are sparse/dense in different regions the user-given query locations may be far from all the trajectories 4 The aggregate distance in k-NNT query 1. Find out the closest point from a trajectory to each query point (i.e., shortest matching pairs) 3. Sum up the lengths of all matching pairs R1 R2 p1,1 p2,1 p1,3 p1,2 q1 p1,4 q2 p2,3 p2,2 •dist(R1, q1)= dist(p1,2, q1)= 20 m •dist(R1, q2)= dist(p1,3, q2)= 50 m •dist(R1, q3)= dist(p1,5, q3)= 15 m •dist(R1, Q)=∑ dist(R1, qi)= 85 m p2,4 p2,5 p1,5 q3 p2,6 •dist(R2, q1)= dist(p2,3, q1)= 30 m •dist(R2, q2)= dist(p2,4, q2)= 5 m •dist(R2, q3)= dist(p2,6, q3)= 40 m •dist(R2, Q)=∑ dist(R2, qi)= 75 m 5 Related Work: k-BCT query k-Best Connected Trajectory (k-BCT) query [SIGMOD2010] the similarity function between a trajectory R and query locations Q is Problem: This function changes over units (inconsistent) An example If query Q has two points q1 and q2; dist(R1, q1) = dist(R1, q2) = 2.4km = 1.48 miles, dist(R2, q1) = 1.5 km =0.93 miles, dist(R2, q2) = 5km = 3.1 miles Use unit “mile”, Sim(R1, Q) = 0.45 > Sim(R2, Q) = 0.43 Use unit “km”, Sim(R1, Q) = 0.18 < Sim(R2, Q) = 0.22 6 Advantages of k-NNT over k-BCT The distance function of k-BCT changes over units (inconsistent) The distance function of k-BCT is sensitive to a query •k-BCT •k-BCT&k-NNT •k-NNT q1 q2 q3 7 Query framework: candidate-generation-and-verification Candidate generation R1 Best-first search based individual heaps Coordination by a global heap R2 R3 R4 q1 Direct Computing dist(R1, Q)= 5+2+2=9 m dist(R2, Q)= 25+20+30=75m dist(R3, Q)= 80+25+30=135m dist(R4, Q)= 90+5+3=98 m dist(R5, Q)= 55+8+70=123m dist(R6, Q)= 120+80+40=240 m R5 q2 q3 R6 Candidate verification Candidate Generation Lower-bound estimation R1 Efficient pruning with the R global heap R4 q1 Candidate Verification 5 q2 Outlier query location Qualifier expectation based method dist(R1, Q)= 5+2+2=9 m dist(R4, Q)= 90+5+3=98 m dist(R5, Q)= 55+8+70=123m q3 8 Candidate Generation Given a query Q = {q1, q2, …, qm}, generate a trajectory candidate set including all the k-NNTs (i.e., complete set) Step 1: searching k-NN points using best-first-based individual heap Step 2: generating the candidate trajectories by the global heap R1 <p2,3, q1> <p5,2, q1> <p1,6, q1> <p2,9, q1> …... h1 <p6,2, q2> <p5,3, q2> <p7,4, q2> <p4,8, q2> …... h2 <p2,2, q3> <p3,5, q3> <p7,3, q3> <p8,6, q3> …... h3 R2 R3 R4 q1 R5 q2 R6 q3 9 Step 2: generating candidate trajectories Global heap A minimum heap sorting matching pairs by the distance Retrieves new matching pair from individual heaps Pops the matching pairs to the candidate set Advantages guarantee including all kNNTs in candidate set generate compact candidate sets R1: <p1,2, q1>, <p1,5, q2>, <p1,3, q3>, ……, <p1,9, qm>. R2: , <p2,2, q2>, <p2,4, q3>, ……, . R4: <p4,5, q1>, , <p4,3, q3>, ……, <p4,7, qm> ………... Candidate Set Global Heap (Size=m) <p1,4, q1>, <p5,1, q3>, <p6,4, q4>, <p3,4, q2>, …... h1 <p2,3, q1> <p5,2, q1> <p1,6, q1> <p2,9, q1> …... h2 <p6,2, q2> <p5,3, q2> <p7,4, q2> <p4,8, q2> …... hm h3 Individual Heaps <p2,2, q3> <p5,1, qm> <p3,5, q3> <p2,3, qm> <p7,3, q3> <p5,7, qm> …... <p8,6, q3> <p9,2, qm> …... …... 10 Example: Search based on the global heap R1 R3 R2 R4 p1,2 q1 Candidate Set R5 p5,5 q2 Global Heap p4,4 p1,4 p4,5 h1 p1,6 q3 •<p1,2, q1> …… h2 •<p1,4, q2> …… h3 •<p1,6, q3> …… Individual Heaps 11 Example: Search based on the global heap R1 R3 R2 R4 •R1: (Partial Match) p1,2 q1 Candidate Set R5 p5,5 q2 p4,4 p1,4 Global Heap •<p1,4, q2> •<p1,6, q3> •<p1,2, q1> p4,5 h1 p1,6 h2 q3 …… •<p5,5, q2> …… h3 …… Individual Heaps 12 Example: Search based on the global heap R1 R3 R2 R4 •R1: •<p1,4, q2> (Partial Match) p1,2 q1 Candidate Set R5 p5,5 q2 p4,4 p1,4 Global Heap •<p1,6, q3> •<p5,5, q2> •<p1,2, q1> p4,5 h1 p1,6 h2 h3 q3 …… …… •<p4,5, q3> …… Individual Heaps 13 Example: Search based on the global heap R1 R3 R2 R4 p1,2 q1 •R1: •<p1,4, q2>•<p1,6, q3> (Partial Match) •R5: (Partial Match) Candidate Set R5 p5,5 q2 p4,4 p1,4 Global Heap •<p5,5, q2> •<p4,5, q3> •<p1,2, q1> p4,5 h1 p1,6 q3 h2 •<p4,4, q2> …… h3 …… …… Individual Heaps 14 Example: Search based on the global heap R1 R3 R2 R4 R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match) R4: p1,2 R5: q1 <p4,5, q3>. <p5,5, q2>. (Partial Match) (Partial Match) Candidate Set R5 p5,5 q2 Global Heap <p1,2, q1>, <p4,4, q2>, <p1,5, q3> p4,4 p1,4 p4,5 h1 p1,6 q3 h2 …… h3 …… …… Stop critiria: when there is k full-matching candidates – Property 1: The candidate set is complete if G has popped out k full-matching candidates (InHeaps this example k=1) Individual 15 Candidate verification R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match) R4: R5: <p4,5, q3>. <p5,5, q2>. Candidate Set (Partial Match) (Partial Match) The full-matching candidate may not be the final k-NNT The system has to retrieve the partial-matching trajectories (R4 and R5) to compute their aggregate distance (I/O cost) Question: can we compute a lower-bound for R4 and R5 without retrieving their details? If LB(R4/5) > dist(R1,Q), we can prune it directly 16 Candidate verification The lower-bound of a partial-matching trajectory is If the LB(R) is larger than the distance of full-matching candidate, R can be pruned directly R1: <p1,2, q1> <p1,4, q2> <p1,6, q3> dist(R1) = 95 <p4,5, q3> LB(R4) =114 (pruned) R4: R5: Candidate Set <p5,5, q2> •<p1,2, q1> •<p4,4, q2> LB(R5) =90 (passed) •<p1,5, q3> Global Heap 17 Problem of Outlier Query Location A query location is an outlier if it is far from all the trajectories Too many partial-matching candidates will be generated before finding a full-matching candidates R1 R2 p2,1 R3 q1 p1,2 R4 Candidate Set R1: <p1,1, q1>, <p1,4, q2>, . (Partial Matching) R2: <p2,1, q1>, <p2,5, q2>, . (Partial Matching) R4: , <p4,4, q2>, . (Partial Matching) p2,2 <p1,7, q3> cannot be popped out q2 p1,4 p2,5 Global Heap <p2,5, q2>, <p2,1, q1>, <p1,7, q3> Iteration 1 p4,4 p1,7 p2,6 <p2,1, q1>, <p1,4, q2>, <p1,7, q3> Iteration 2 <p1,4, q2>, <p1,1, q1>, <p1,7, q3> Iteration 3 q3 <p1,1, q1>, <p4,4, q2>, <p1,7, q3> Iteration 4 …… 18 Qualifier expectation based method The system can make up the missing pairs of a partialmatching trajectory by retrieving all its points Two key issues: Guarantee the completeness of candidate set Property 2: If there are k made-up candidates (qualifier) with distance smaller than the sum of the pairs in global heap, the candidate set is complete Which candidate should be selected to make up? The qualifier expectation measure 19 Example of Qualifier Expectation R1 R2 p2,1 R3 p2,2 q2 R1: 40m. R2: 30m. R4: 15m. p1,4 p2,5 p4,4 p1,7 q3 dist(R1) =160m < sum(G), R1 is a qualifier •R1: <p1,1, q1>, <p1,4, q2>, <p1,7, q3>. q1 p1,2 R4 p2,6 Qualifier Expectation R1: <p1,1, q1>, <p1,4, q2>, . R2: <p2,1, q1>, <p2,5, q2>, . R4: . ,<p4,4, q2>, Candidate Set Global Heap, total dist sum(G) = 200m <p2,1, q1>, <p4,4, q2>, <p1,7, q3> 20 Experiment Setup Real Dataset: collected from the Microsoft Geolife and T-Drive projects , with over 20,000 real trajectories Synthetic datasets with both uniform distribution and biased distribution Random generated query Q The proposed methods are compared with Fagin’s Algorithm (FA) and Threshold Algorithm (TA) (used in kBCT) 21 Evaluations on synthetic dataset (biased distribution) GH (global heap) is faster than baselines with less I/O costs QE( global heap+ qualifier expectation ) is an order of magnitude faster than others GH 100000 QE 100000 TA 10000000 Time (unit: ms) 10000 1000000 1000 100000 FA Accessed Rtree Nodes 10000 1000 10000 100 3k 6k 100 9k 12k 1000 2 4 6 8 (a) Query Time vs. |Q| 10 2 4 6 8 10 (b) I/O Cost vs. |Q| 22 Evaluations on real dataset When |Q| is small, the probability of outlier location is low, GH achieves the best performance When |Q| is larger, the probability of outlier location is high, QE is more efficient GH 10000 QE 100000 TA 1000000 Time (unit: ms) FA Accessed Rtree Nodes 10000 1000 100000 1000 100 10000 100 3k 6k 10 9k 12k 1000 2 4 6 8 (a) Query Time vs. |Q| 10 2 4 6 8 10 (b) I/O Cost vs. |Q| 23 Conclusion k-Nearest Neighboring Trajectory (k-NNT) query retrieve trajectories by a set of locations Candidate-generation-and-verification framework Generate candidate trajectories with global heap Efficient lower-bound computation Outlier query location: qualifier expectation based method 24 Thanks very much! Any Questions? 25