Similarity-based Analysis for Trajectory Data Kevin Zheng 25/04/2014 DASFAA 2014 Tutorial 1 Outline • Background – What is trajectory – Where do they come from – Why are they useful – Characteristics • Trajectory similarity search – Query classification – Trajectory similarity measures – Trajectory index • Similarity-based trajectory mining – Popular route mining – Co-traveller discovery – Trajectory clustering 25/04/2014 DASFAA 2014 Tutorial 2 Outline • Background – What is trajectory – Where do they come from – Why are they useful – Characteristics • Trajectory similarity search – Query classification – Trajectory similarity measures – Trajectory index • Similarity-based trajectory mining – Popular route mining – Co-traveller discovery – Trajectory clustering 25/04/2014 DASFAA 2014 Tutorial 3 What is trajectory? • Historical location records of moving objects • In mathematics – Continuous function: time à location – Location can be any dimension • In real applications – Locations are sampled periodically – A finite sequence of time-stamped locations: <p1, t1>, <p2, t2> …, <pn, tn> – p: two or three dimensions (longitude, latitude) 25/04/2014 DASFAA 2014 Tutorial 4 Where is it from? 25/04/2014 DASFAA 2014 Tutorial 5 Where is it from? • GPS module on moving objects – Vehicles, mobile phone users, animals • Online social network – Twitter, Flickr, Facebook, Weibo • Sensors – Surveillance cameras, RFID, WiFi • More … 25/04/2014 DASFAA 2014 Tutorial 6 Who cares about it? • Government – Traffic pattern analysis – Public transportation management – Urban planning • Business – Location-based service – Personalized advertisement & recommendation – Taxi company, logistic company • Scientists & Researchers – Zoologist, meteorologist, astronomer – Open problems, challenging tasks • More … 25/04/2014 DASFAA 2014 Tutorial 7 Trajectory data are BIG • Volume • Velocity • Variety 25/04/2014 DASFAA 2014 Tutorial 8 Volume • In 2010, 1 billion vehicles – Taxi, logistic companies keep tracking their vehicles – Self-driving car in near future? • In 2012, 1.08 billion smartphone users • In 2013, 20 million surveillance cameras in China • They are generator! – The data keep accumulated 25/04/2014 DASFAA 2014 Tutorial 9 Velocity • Not just huge, they’re being generated quickly • Vehicle tracking & navigation – Re-position every few seconds • Geo-tagged social media – 2 million Flickr photos per day, 5% geo-tagged – 100 million posts on Sina Weibo per day, 1-2% geo-tagged – 400 million tweets per day, 1% geo-tagged • Sensors – How many cars pass a road camera every day? 25/04/2014 DASFAA 2014 Tutorial 10 Geo-tagged tweets Images courtesy of Twitter 25/04/2014 DASFAA 2014 Tutorial 11 Variety • Data source • Tracking devices – Car GPS, smartphones, sensors • Tracking methods – Sampling strategy, sampling rate, • Spatial length & temporal duration • Data quality 25/04/2014 DASFAA 2014 Tutorial 12 Research directions • • • • • Scalable, real-time data processing Flexible database storage and index Effective similarity measures Uncertainty management Data compression Key and fundamental research problem: similarity-based analysis 25/04/2014 DASFAA 2014 Tutorial 13 Outline • Background – What is trajectory – Where do they come from – Why are they useful – Characteristics • Trajectory similarity search – Query classification – Trajectory similarity measures – Trajectory index • Similarity-based trajectory mining – Popular route mining – Co-traveller discovery – Trajectory clustering 25/04/2014 DASFAA 2014 Tutorial 14 Similarity-based analysis for trajectories • Core problem: trajectory similarity search – Input: a trajectory dataset D, a query Q – Output: a subset of D that are ‘similar’ to Q • Foundation – Trajectory similarity measures • Approach – Index and search algorithm • Application – Popular route mining (route recommendation) – co-traveller discovery, clustering, classification, etc… 25/04/2014 DASFAA 2014 Tutorial 15 Similarity query classification • P-query – Query: point(s) • R-query – Query: region (spatial & temporal dimension) • T-query – Query: trajectory 25/04/2014 DASFAA 2014 Tutorial 16 P-query (single point) Query location: q Temporal constraint (optional): tc = [ts, te] ts te q ts 𝐷(𝑞, 𝑇)=𝑚𝑖𝑛𝑑𝑖𝑠𝑡(𝑞,𝑝) 𝑝∈𝑇 and satisfy tc dist(q,p): - Lp-norm - Network distance te [Tao2002] Tao Y., Papadias D. and Shen Q., Continuous nearest neighbour search, VLDB, 2002 25/04/2014 DASFAA 2014 Tutorial 17 P-query (multiple points) q1 q2 Query locations Q: q1, q2, q3, q4 q3 D(Q,T) is an aggregate function of D(q,T) q4 [Chen2010] Chen Z., Shen HT., Zhou X., Zheng Y and Xie X., Searching trajectories by locations – an efficiency study. SIGMOD 2010 25/04/2014 DASFAA 2014 Tutorial 18 R-query • Spatial region: R • Temporal interval:[ts, te] ts Ask for trajectories in a given region during a time interval R te ts te [Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000 25/04/2014 DASFAA 2014 Tutorial 19 T-query • Query: Tq How to measure their distance? Tq 25/04/2014 DASFAA 2014 Tutorial 20 Trajectory similarity measures • • • • • • Many-to-many mapping Different semantic/applications Different lengths Different sampling rates Noises Temporal dimension? 25/04/2014 DASFAA 2014 Tutorial 21 Classification Consider location only Based on location samples Discrete Continuous Consider both location and time Spatial-only Spatial-temporal Lp-norm DTW LCSS EDR DTW, LCSS, EDR with time constrain OWD LIP Synchronous Euclidean Distance Based on line segments or curves 25/04/2014 DASFAA 2014 Tutorial 22 Classification Spatial-only Discrete Continuous 25/04/2014 Spatial-temporal Lp-norm DTW LCSS EDR DTW, LCSS, EDR with time constrain OWD LIP Synchronous Euclidean Distance DASFAA 2014 Tutorial 23 Lp-norm • Average Lp-norm distance of all matched locations • 1-to-1 mapping • Trajectories are of the same length 25/04/2014 DASFAA 2014 Tutorial 24 Lp-norm • Cannot detect similar trajectories with different sampling rates • Sensitive to noise 25/04/2014 DASFAA 2014 Tutorial 25 DTW • Dynamic Time Warping distance – Adaptation from time series distance measure – Used to handle time shift and scale in time series • Optimal order-aware alignment between two sequences – Goal: minimize the aggregate distance between matched points • 1-to-many mapping Yi, Byoung-Kee, Jagadish, HV and Faloutsos, Christos, Efficient retrieval of similar time sequences under time warping. ICDE 1998 25/04/2014 DASFAA 2014 Tutorial 26 DTW for trajectories • Nothing to do with ‘time’ at all • Useful when detecting similar trajectories with different sampling rates • Sensitive to noise 25/04/2014 DASFAA 2014 Tutorial 27 LCSS • Longest Common Sub-Sequence • Adaptation of string similarity – Lcss(‘abcde’,’bd’) = 2 • Threshold-based equality relationship – Two locations are regarded as equal if they’re ‘close’ (compared to a threshold) • 1-to-(1 or null) mapping VLACHOS, M., GUNOPULOS, D., AND KOLLIOS, G. Discovering similar multidimensional trajectories. ICDE 2002 25/04/2014 DASFAA 2014 Tutorial 28 LCSS • Insensitive to noise • Not easy to define threshold • May return dissimilar trajectories p5 p3 p4 p2 p1 25/04/2014 p’1 p’3 p’2 DASFAA 2014 Tutorial 29 EDR • Edit Distance on Real sequence • Adaptation from Edit Distance on strings – Number of insert, delete, replace needed to convert A into B • Threshold-based equality relationship – Two locations are regarded as equal if they’re ‘close’ (compared to a threshold) Lei Chen, M. Tamer Ozsu, Vincent Oria, Robust and Fast Similarity Search for Moving Object Trajectories. SIGMOD 2005 25/04/2014 DASFAA 2014 Tutorial 30 EDR • Value means the number of operations, not “distance between locations” – Insensitive to noise replace p5 insert p3 p4 p2 insert 25/04/2014 p1 p’1 p’3 p’2 DASFAA 2014 Tutorial 31 LCSS and EDR • They are both count-based – LCSS counts the number of matched pairs – EDR counts the cost of operations needed to fix the unmatched pairs • Higher LCSS, lower EDR • If cost(replace) = cost(insert) + cost(delete): • EDR(X,Y) = L(X)+L(Y) – 2LCSS(X,Y) 25/04/2014 DASFAA 2014 Tutorial 32 Classification Spatial-only Discrete Continuous 25/04/2014 Spatial-temporal Lp-norm DTW LCSS EDR DTW, LCSS, EDR with time constrain OWD LIP Synchronous Euclidean Distance DASFAA 2014 Tutorial 33 OWD • One Way Distance from T1 to T2 is: – Integral of the distance from points of T1 to T2 – Divided by the length of T1 • Make it into symmetric measure Bin Lin, Jianwen Su, One Way Distance: For Shape Based Similarity Search of Moving Object Trajectories. In Geoinformatica (2008) 25/04/2014 DASFAA 2014 Tutorial 34 OWD example • Consider one trajectory as piece-wise line segment, and the other as discrete samples 25/04/2014 DASFAA 2014 Tutorial 35 LIP distance • Locality In-between Polylines – Polygon is the set of polygons formed between intersection points – Nikos Pelekis et al, Similarity Search in Trajectory Databases. Symposium on Temporal Representation and Reasoning 2007 25/04/2014 DASFAA 2014 Tutorial 36 LIP distance • Only work for 2-dimensional trajectories • Polygon à polyhedron: non-trivial change 25/04/2014 DASFAA 2014 Tutorial 37 Spatial-temporal similarity measure • All the spatial-only similarity measures can incorporate time information in trajectories • Lp-norm and DTW can apply a temporal constrain for more synchronized alignment • EDR and LCSS can apply a temporal threshold on top of spatial threshold 25/04/2014 DASFAA 2014 Tutorial 38 Classification Spatial-only Discrete Continuous 25/04/2014 Spatial-temporal Lp-norm DTW LCSS EDR DTW, LCSS, EDR with time constrain OWD LIP Synchronous Euclidean Distance DASFAA 2014 Tutorial 39 DTW with temporal constrain Time tolerance = 2 13 15 10 9 7 10 25/04/2014 DASFAA 2014 Tutorial 40 Spatial-temporal LCSS • A temporal threshold controls how far in time we can go in order to match two locations p5 t3=4 p3 t4=7 p4 t2=2 p2 t1=1 p1 p’1 t’1=1 t5=11 Time threshold=2 p’3 t’3=4 p’2 t’2=3 VLACHOS, M., GUNOPULOS, D., AND KOLLIOS, G. Discovering similar multidimensional trajectories. ICDE 2002 25/04/2014 DASFAA 2014 Tutorial 41 Classification Spatial-only Discrete Continuous 25/04/2014 Spatial-temporal Lp-norm DTW LCSS EDR DTW, LCSS, EDR with time constrain OWD LIP Synchronous Euclidean Distance DASFAA 2014 Tutorial 42 SED • Synchronous Euclidean Distance – Euclidean distance between locations at the same time instance of two trajectories • Regard trajectory as continuous function of time Mirco Nanni, Dino Pedreschi, Time-focused clustering of trajectories of moving objects. Journal of Intelligent Information Systems (2006) POTAMIAS, M., PATROUMPAS, K., AND SELLIS, T. K. Sampling trajectory streams with spatiotemporal criteria. SSDBM 2006 25/04/2014 DASFAA 2014 Tutorial 43 SED Virtually create a sample point at t=3 t=10 t=0 t=20 t=15 t=5 t=10 t=3 t=0 25/04/2014 t=25 t=12 t=7 DASFAA 2014 Tutorial 44 Trajectory index • Similarity measures give the way to calculate the distance between two trajectories • However … – Huge amount of trajectories – Linear scan is inefficient 25/04/2014 DASFAA 2014 Tutorial 45 Trajectory index • • • • 3D R-tree STR-tree (Spatio-temporal R-tree) TB-tree (Trajecoty Bundle) Multi-version R-tree – Partition temporal dimension • Grid-based index – Partition spatial dimension 25/04/2014 DASFAA 2014 Tutorial 46 3D R-tree • Indexing position samples only – Cannot answer queries about movements inbetween those samples • Indexing line segments Time y x 25/04/2014 DASFAA 2014 Tutorial 47 Problem of R-tree • Large “dead space” – MBB covers a large portion of the space with no data – Low pruning power • Trajectory preservation – Line segments are grouped merely by spatial proximity – Regardless which trajectory they belong to – Retrieving a trajectory requires visits of different paths in the tree 25/04/2014 DASFAA 2014 Tutorial 48 Augmented 3D R-tree • Augment leaf node with orientation information • Distance to line segments can be approximated more accurately q [Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000 25/04/2014 DASFAA 2014 Tutorial 49 STR-tree • Extension of augmented 3D R-tree • Insertion – Try to keep the line segments belonging to the same trajectory together – Find the leaf node containing the predecessor • Node split – Put disconnected segments into new node – Put the most recent backward-connected segment into new node [Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000 25/04/2014 DASFAA 2014 Tutorial 50 TB-tree (Trajectory Bundle) • A leaf node only contains segments belonging to the same trajectory • Leaf nodes containing same trajectory are linked • Strictly preserve trajectories [Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000 25/04/2014 DASFAA 2014 Tutorial 51 TB-tree (Trajectory Bundle) 25/04/2014 DASFAA 2014 Tutorial 52 Reflection • Previous index treat spatial and temporal dimensions equally – 3D tree structure • In trajectory database, temporal dimension is more dynamic – New segments are appended to existing trajectories – Archived trajectories rarely update • Indexing spatial and temporal dimensions separately – Partition temporal dimension first – Partition spatial dimension first 25/04/2014 DASFAA 2014 Tutorial 53 Multi-version R-tree For each 2mestamp, an R-­‐tree is created. So, there are many R-­‐trees. These R-­‐trees are indexed. HR-­‐tree Query for trajectories in a given region and in a given time interval: 1. The R-tree at the timestamp is found first 2. The trajectories in the specified region are retrieved from the R-tree. [Nascimento1998] Nascimento, M., Silva, J. Towards Historical R-trees. ACM SAC, 1998 [Tao2001a] Tao, Y., Papadias, D.: Efficient historical r-trees. In: ssdbm, p. 0223. Published by the IEEE Computer Society (2001) [Xu2005]Xu, X., Han, J., Lu, W.: Rt-tree: An improved r-tree indexing structure for temporal spatial databases. In: Int. Symp. on Spatial Data Handling, 2005 [Tao2001b] Tao, Y., Papadias, D.: Mv3r-tree: A spatio-temporal access method for timestamp and interval queries. In: VLDB, pp. 431-440 (2001) 25/04/2014 DASFAA 2014 Tutorial 54 Grid-based index • Partition space into non-overlapping cells • Trajectory segments in each cell are indexed on temporal dimension • Query processing: – Spatial filtering – Temporal filtering [Prasad2003] V. Prasad Chakka Adam C. Everspaugh Jignesh M., Patel, Indexing Large Trajectory Data Sets With SETI, CIDR 2003 25/04/2014 DASFAA 2014 Tutorial 55 Grid-based index [Prasad2003] V. Prasad Chakka Adam C. Everspaugh Jignesh M., Patel, Indexing Large Trajectory Data Sets With SETI, CIDR 2003 25/04/2014 DASFAA 2014 Tutorial 56 Outline • Background – What is trajectory – Where do they come from – Why are they useful – Characteristics • Trajectory similarity search – Query classification – Trajectory similarity measures – Trajectory index • Similarity-based trajectory mining – Popular route mining – Co-traveller discovery – Trajectory clustering 25/04/2014 DASFAA 2014 Tutorial 57 Popular route mining • Shortest path may not be the favourable • Find the most popular/desirable path using the GPS trajectories of past travellers • Classification – No specific source/destination – hot route discovery – Specific source/destination – popular route search 25/04/2014 DASFAA 2014 Tutorial 58 T-pattern • A set of individual trajectories that share the property of visiting the same sequence of places with similar travel times 1. Spatial discretization: discretize space into finite set of regions of interest (RoI) 2. Translate trajectories into sequence of RoIs 3. Adapt sequential pattern mining algorithms with time constrain F. Giannotti, M. Nanni, F. Pinelli, and D. Pedreschi. Trajectory pattern mining. In SIGKDD, pages 330–339, 2007 25/04/2014 DASFAA 2014 Tutorial 59 Periodic pattern • Find the repeat pattern for individual’s trajectory 1. Pre-defined and synchronized timestamps 2. Adaptive region: density-based cluster 3. Trajectories are translated to sequence of regions at pre-defined time instances N. Mamoulis, H. Cao, G. Kollios, M. Hadjieleftheriou, Y. Tao, and D. W. Cheung. Mining, indexing, and querying historical spatiotemporal data. In SIGKDD, pages 236–245, 2004. 25/04/2014 DASFAA 2014 Tutorial 60 Interesting travel sequence • A sequence of interesting locations that is travelled by experienced drivers frequently • Interestingness of a location? – How many experienced travellers visited it • Experience of a traveller? – How many interesting locations she has visited Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining interesting locations and travel sequences from gps trajectories. In WWW, pages 791–800, 2009. 25/04/2014 DASFAA 2014 Tutorial 61 Interesting travel sequence • Hyperlink-Induced Topic Search (HITS) 25/04/2014 DASFAA 2014 Tutorial 62 Popular route search • Given source/destination, find/estimate the most popular route in between • Ideally, we can just count the number of trajectories on different paths connecting the two locations Z. Chen, H. T. Shen, and X. Zhou. Discovering popular routes from trajectories. In ICDE, pages 900–911, 2011 25/04/2014 DASFAA 2014 Tutorial 63 Popular route discovery • In real scenarios, it’s not easy to find such well-divided groups • Even worse, there’s no trajectory connecting two locations at all! 25/04/2014 DASFAA 2014 Tutorial 64 Popular route discovery • Construct a transfer network from raw trajectories as an intermediate result to capture the moving behaviors between locations – Node: cluster of turning points – Edge: trajectories passing two nodes 25/04/2014 DASFAA 2014 Tutorial 65 Popular route discovery • Transfer probability • Find the route with the highest joint transfer probability w.r.t. destination 25/04/2014 DASFAA 2014 Tutorial 66 Find popular route from uncertain trajectories • Low-sampling-rate trajectories have high degree of uncertainty • Uncertain trajectories are prevalent! • Is it possible to recover the original route given a low-sampling-rate trajectory? • What if… – There is a historical set of uncertain trajectories Kai Zheng, Yu Zheng, Xing Xie and Xiaofang Zhou. Reducing Uncertainty of Low-Sampling-Rate Trajectories. ICDE 2012 25/04/2014 DASFAA 2014 Tutorial 67 Find popular route from uncertain trajectories • Can we find popular routes for specific s/d using low-sampling-rate trajectories Infrequent samples on the same path can reinforce each other, and they collectively form a more ‘dense’ trajectory 25/04/2014 DASFAA 2014 Tutorial 68 Find popular route from uncertain trajectories • Find PR for a given sequence of locations • Two-phase approach – Local PR construction – Global PR search 25/04/2014 DASFAA 2014 Tutorial 69 Find popular route from uncertain trajectories without road networks • A road network is not always available or applicable 1. Discretize space into disjoint cells 2. Derive the transfer graph using cells 3. Infer frequent ‘virtual edges’ on the graph L.-Y. Wei, Y. Zheng, and W.-C. Peng. Constructing popular routes from uncertain trajectories. In ACM SIGKDD, pages 195–203, 2012. 25/04/2014 DASFAA 2014 Tutorial 70 Time period-based most frequent path (TPMFP) • Find the most frequent path for specific s/d and time period • Desired properties for a MFP – Suffix optimal: suffix of a MFP is also a MFP – Length insensitive: MFP shouldn’t favor long/short route – Bottleneck free: MFP shouldn’t contain infrequent edges Wuman Luo, Haoyu Tan, Lei Chen, Lionel M. Ni. Finding Time Period-Based Most Frequent Path in Big Trajectory Data. SIGMOD 2013 25/04/2014 DASFAA 2014 Tutorial 71 Time period-based most frequent path (TPMFP) • Footmark graph – A weighted sub-graph – Edge frequency: number of trajectories reaching vd during T – Path frequency: non-decreasingly sorted sequence of edge frequencies v1 à v2: 14, v2 à v3: 10 v3 à v12: 10, v2 à v12: 8 V1 à v2 à v12: (8, 14) V1àv2àv3àv12: (10,10,14) V1àv10àv11àv12: (1,21,21) 25/04/2014 DASFAA 2014 Tutorial 72 Time period-based most frequent path (TPMFP) • Compare path frequency – More-frequent-than relation (>) – F > F’ if their first different value fj > fj’ – (10,10,14) > (8,14) > (1,21,21) • Property – A total order, so MFP always exist – Guarantee the suffix optimal – Length of path doesn’t matter a lot – Path with infrequent edge frequency be disadvantaged 25/04/2014 DASFAA 2014 Tutorial 73 PR search is more challenging • Specific source/destination (and time constraint) – Data are sparse – How to utilize more relevant data? • Online performance – Efficiency is critical 25/04/2014 DASFAA 2014 Tutorial 74 Summary • Background – What is trajectory – Where do they come from – Why are they useful – Characteristics • Trajectory similarity search – Query classification – Trajectory similarity measures – Trajectory index • Similarity-based trajectory mining – Popular route mining – Co-traveller discovery – Trajectory clustering 25/04/2014 DASFAA 2014 Tutorial 75 Thank you • Questions – kevinz@itee.uq.edu.au • DKE group at UQ – http://www.itee.uq.edu.au/dke/dke-lab • My research – http://staff.itee.uq.edu.au/kevinz/ 25/04/2014 DASFAA 2014 Tutorial 76