Analysing Big Trajectory Data Fundamental to Applications Kevin Zheng 23/07/2014 PhD School 2014 1 Outline • Background – – – – What is trajectory Where do they come from Why are they useful Characteristics • Trajectory query processing – Query classification – Trajectory distance measures – Trajectory index • Applications – Popular route mining – Co-traveller discovery – Trajectory clustering 23/07/2014 PhD School 2014 2 Outline • Background – – – – What is trajectory Where do they come from Why are they useful Characteristics • Trajectory query processing – Query classification – Trajectory distance measures – Trajectory index • Applications – Popular route mining – Co-traveller discovery – Trajectory clustering 23/07/2014 PhD School 2014 3 What is trajectory? • Historical location records of moving objects • In mathematics – Continuous function: time location – Location can be any dimension • In real applications – Locations are sampled periodically – A finite sequence of time-stamped locations: <p1, t1>, <p2, t2> …, <pn, tn> – p: two or three dimensions (longitude, latitude) 23/07/2014 PhD School 2014 4 Where is it from? 23/07/2014 PhD School 2014 5 Where is it from? • Location tracking devices – Vehicles, mobile phone users, animals • Records in cyber-space (activity trajectory) – SNS: Twitter, Facebook, Weibo – Credit card statement • Sensors (passive mode) – Surveillance cameras, WiFi 23/07/2014 PhD School 2014 6 Who cares about it? • Government – Traffic analysis and predication – Urban planning • Business – Location-based service providers – Personalized advertisement – Taxi company, logistic company • Scientists & Researchers – Zoologist, meteorologist – Research interest 23/07/2014 PhD School 2014 7 BTW: Apple also cares a lot • If you’ve got an iPhone 4S or later running iOS 7, follow these steps: – 1. Go to Settings, then Privacy > Location Services. – 2. Scroll to the bottom of the list of apps that use Location Services. Tap System Services. Yours is still on? 23/07/2014 PhD School 2014 8 Trajectory data are BIG • Volume • Velocity • Variety 23/07/2014 PhD School 2014 9 Volume • In 2010, 1 billion vehicles – Taxi, logistic companies keep tracking their vehicles – Self-driving car in near future? • In 2012, 1.08 billion smartphone users • In 2013, 20 million surveillance cameras in China • They are generator! – The data keep accumulated 23/07/2014 PhD School 2014 10 Velocity • Not just huge, they’re being generated quickly • Vehicle tracking & navigation – Re-position every few seconds • Geo-tagged social media – 2 million Flickr photos per day, 5% geo-tagged – 100 million posts on Sina Weibo per day, 1-2% geo-tagged – 400 million tweets per day, 1% geo-tagged • Sensors – How many cars pass a road camera every day? 23/07/2014 PhD School 2014 11 Geo-tagged tweets Images courtesy of Twitter 23/07/2014 PhD School 2014 12 Variety • Data format – – – – Longitude, latitude: GPS trajectory ID of POI: check-in record ID of road segment: navigation applications Bus/metro stop: travel smart card • Tracking devices – Car GPS, smartphones, sensors • Sampling rate • Data quality and uncertainty 23/07/2014 PhD School 2014 13 Research directions • • • • • Scalable, real-time data processing Flexible database storage and index Effective similarity measures Uncertainty management Data compression Key and fundamental research problem: similarity-based analysis 23/07/2014 PhD School 2014 14 Outline • Background – – – – What is trajectory Where do they come from Why are they useful Characteristics • Trajectory query processing – Query classification – Trajectory distance measures – Trajectory index • Application – Popular route mining – Co-traveller discovery – Trajectory clustering 23/07/2014 PhD School 2014 15 Trajectory query processing • Input: a trajectory dataset D, a query object Q • Output: a subset of D that are ‘relevant’ to Q • Types of Q – Point, region, trajectory • Relevance to Q – Distance function • How to search efficiently? – Trajectory index 23/07/2014 PhD School 2014 16 Query types • NN query – Query: point(s) • Range query – Query: region • Similarity query – Query: trajectory 23/07/2014 PhD School 2014 17 NN query (single point) Find a moving object that was the closest to a given query location Query location: q Temporal constraint (optional): tc = [ts, te] te ts q ts Nearest sample point Actual nearest location • It’s more costly to find nn according to actual nearest location • For a high sampled trajectory, nearest sample point is a good approximation te [Tao2002] Tao Y., Papadias D. and Shen Q., Continuous nearest neighbour search, VLDB, 2002 23/07/2014 PhD School 2014 18 NN query (multiple points) q1 q2 Query locations Q: q1, q2, q3, q4 q3 D(Q,T) is an aggregate function of D(q,T) q4 [Chen2010] Chen Z., Shen HT., Zhou X., Zheng Y and Xie X., Searching trajectories by locations – an efficiency study. SIGMOD 2010 23/07/2014 PhD School 2014 19 Range query • Find the moving objects that traveled within a given region during a given time interval • Spatial region: R • Temporal interval:[ts, te] R ts te ts te [Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000 23/07/2014 PhD School 2014 20 Similarity query • Find the moving objects that had similar travel history with a given trajectory How to measure their distance? Tq 23/07/2014 PhD School 2014 21 Trajectory distance measures • Compared to point-to-point, point-to-trajectory distance, trajectory-to-trajectory distance measure is more complicated • More factors to consider – Alignment of sampling points – Consider temporal dimension? – Robust to noise? 23/07/2014 PhD School 2014 22 Classification • Alignment of samples – Lock-step vs. adaptive alignment • Similarity metric – Geographical distance vs. count based • Continuity – Discrete vs. continuous • Dimension – Spatial-only vs. spatio-temporal • Underlying space – Euclidean space vs. road network 23/07/2014 PhD School 2014 23 Lock-step alignment • The k-th point of trajectory A is aligned to the k-th point of trajectory B 23/07/2014 PhD School 2014 24 Drawback • Cannot find similar trajectories with different sampling rates • Sensitive to noise 23/07/2014 PhD School 2014 25 Adaptive alignment • Dynamic Time Warping distance – Adaptation from time series distance measure – Used to handle time shift and scale in time series • Optimal order-aware alignment between two sequences – Goal: minimize the aggregate distance between matched points • 1-to-many mapping Yi, Byoung-Kee, Jagadish, HV and Faloutsos, Christos, Efficient retrieval of similar time sequences under time warping. ICDE 1998 23/07/2014 PhD School 2014 26 DTW for trajectories • Useful when detecting similar trajectories with different sampling rates 23/07/2014 PhD School 2014 27 Similarity metric • Geographical distance based – Similarity is measured by the geographical distance between samples – DTW • Count based – Similarity is measured by the number of ‘similar’/ ‘dissimilar’ samples – LCSS: count the similar sample pairs – EDR: count the dissimilar sample pairs 23/07/2014 PhD School 2014 28 LCSS • Longest Common Sub-Sequence • Adaptation of string similarity – Lcss(‘abcde’,’bd’) = 2 • Threshold-based equality relationship – Two locations are regarded as equal if they’re ‘close’ enough (compared to a threshold) • 1-to-(1 or null) mapping VLACHOS, M., GUNOPULOS, D., AND KOLLIOS, G. Discovering similar multidimensional trajectories. ICDE 2002 23/07/2014 PhD School 2014 29 LCSS • Insensitive to noise • Not easy to define threshold • May return dissimilar trajectories p5 p3 p4 p2 p’3 p1 23/07/2014 p’1 p’2 PhD School 2014 30 EDR • Edit Distance on Real sequence • Adaptation from Edit Distance on strings – Number of insert, delete, replace needed to convert A into B • Threshold-based equality relationship – Two locations are regarded as equal if they’re ‘close’ enough (compared to a threshold) Lei Chen, M. Tamer Ozsu, Vincent Oria, Robust and Fast Similarity Search for Moving Object Trajectories. SIGMOD 2005 23/07/2014 PhD School 2014 31 EDR • Value means the number of operations, not “distance between locations” – Insensitive to noise replace p5 insert p3 p4 p2 p’3 insert 23/07/2014 p1 p’1 p’2 PhD School 2014 32 LCSS and EDR • They are both count-based – LCSS counts the number of matched pairs – EDR counts the cost of operations needed to fix the unmatched pairs • Higher LCSS, lower EDR • If cost(replace) = cost(insert) + cost(delete): • EDR(X,Y) = L(X)+L(Y) – 2LCSS(X,Y) 23/07/2014 PhD School 2014 33 Continuity • Discrete measures – Only consider the sample points of trajectory – All previous measures • Continuous measures – Consider the line segments between samples – OWD – LIP 23/07/2014 PhD School 2014 34 OWD • One Way Distance from T1 to T2 is: – Integral of the distance from points of T1 to T2 – Divided by the length of T1 • Make it into symmetric measure Bin Lin, Jianwen Su, One Way Distance: For Shape Based Similarity Search of Moving Object Trajectories. In Geoinformatica (2008) 23/07/2014 PhD School 2014 35 OWD example • Consider one trajectory as piece-wise line segment, and the other as discrete samples 23/07/2014 PhD School 2014 36 LIP distance • Locality In-between Polylines – Polygon is the set of polygons formed between intersection points – Nikos Pelekis et al, Similarity Search in Trajectory Databases. Symposium on Temporal Representation and Reasoning 2007 23/07/2014 PhD School 2014 37 LIP distance • Only work for 2-dimensional trajectories • Polygon polyhedron: non-trivial change 23/07/2014 PhD School 2014 38 Dimension • Spatial only – Disregard the time information on sample points • Spatio-temporal – Take the timestamp into consideration – Synchronous Euclidean Distance 23/07/2014 PhD School 2014 39 SED • Synchronous Euclidean Distance – Euclidean distance between locations at the same time instance of two trajectories • Regard trajectory as continuous function of time Mirco Nanni, Dino Pedreschi, Time-focused clustering of trajectories of moving objects. Journal of Intelligent Information Systems (2006) POTAMIAS, M., PATROUMPAS, K., AND SELLIS, T. K. Sampling trajectory streams with spatiotemporal criteria. SSDBM 2006 23/07/2014 PhD School 2014 40 SED Virtually create a sample point at t=3 t=10 t=20 t=0 t=15 t=5 t=10 t=3 t=0 23/07/2014 t=25 t=12 t=7 PhD School 2014 41 Underlying space • Euclidean space • Road network – Distance is defined along shortest path Jung-Rae Hwang, Hye-Young Kang, Ki-Joune Li, Searching for Similar Trajectories on Road Networks Using Spatio-temporal Similarity. Advances in Databases and Information Systems, pp 282 – 295, 2006 23/07/2014 PhD School 2014 42 Trajectory index • For a single trajectory, it’s easy to calculate the nearest distance to a query location or check if it falls inside a given region • But what about millions or even more – Linear scan? • We need to index trajectories – Generic trajectory index for NN and range query – For similarity query, index is usually dependent on similarity measures 23/07/2014 PhD School 2014 43 Generic trajectory index • Treat spatial and temporal dimension equally – 3D R-tree – STR-tree (Spatio-temporal R-tree) – TB-tree (Trajecoty Bundle) • Index temporal dimension first – Multi-version R-tree, Historical R-tree • Index spatial dimension first – Grid based index 23/07/2014 PhD School 2014 44 3D R-tree • Indexing position samples only – Cannot answer queries about movements inbetween those samples • Indexing line segments Time y x 23/07/2014 PhD School 2014 45 Problem of R-tree • Large “dead space” – MBB covers a large portion of the space with no data – Low pruning power • Trajectory preservation – Line segments are grouped merely by spatial proximity – Regardless which trajectory they belong to – Retrieving a trajectory requires multiple visits to different paths in the tree 23/07/2014 PhD School 2014 46 Augmented 3D R-tree • Augment leaf node with orientation information • Distance to line segments can be approximated more accurately q [Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000 23/07/2014 PhD School 2014 47 STR-tree • Extension of augmented 3D R-tree • Insertion – Try to keep the line segments belonging to the same trajectory together – Find the leaf node containing the predecessor • Node split – Put disconnected segments into new node – Put the most recent backward-connected segment into new node [Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000 23/07/2014 PhD School 2014 48 TB-tree (Trajectory Bundle) • A leaf node only contains segments belonging to the same trajectory • Leaf nodes containing same trajectory are linked • Strictly preserve trajectories [Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000 23/07/2014 PhD School 2014 49 Multi-version R-tree For each timestamp, an R-tree is created. So, there are many R-trees. These R-trees are indexed. HR-tree Query for trajectories in a given region and in a given time interval: 1. The R-tree at the timestamp is found first 2. The trajectories in the specified region are retrieved from the R-tree. [Nascimento1998] Nascimento, M., Silva, J. Towards Historical R-trees. ACM SAC, 1998 [Tao2001a] Tao, Y., Papadias, D.: Efficient historical r-trees. In: ssdbm, p. 0223. Published by the IEEE Computer Society (2001) [Xu2005]Xu, X., Han, J., Lu, W.: Rt-tree: An improved r-tree indexing structure for temporal spatial databases. In: Int. Symp. on Spatial Data Handling, 2005 [Tao2001b] Tao, Y., Papadias, D.: Mv3r-tree: A spatio-temporal access method for timestamp and interval queries. In: VLDB, pp. 431-440 (2001) 23/07/2014 PhD School 2014 50 Grid-based index • Partition space into non-overlapping cells • Trajectory segments in each cell are indexed on temporal dimension • Query processing: – Spatial filtering – Temporal filtering [Prasad2003] V. Prasad Chakka Adam C. Everspaugh Jignesh M., Patel, Indexing Large Trajectory Data Sets With SETI, CIDR 2003 23/07/2014 PhD School 2014 51 Grid-based index [Prasad2003] V. Prasad Chakka Adam C. Everspaugh Jignesh M., Patel, Indexing Large Trajectory Data Sets With SETI, CIDR 2003 23/07/2014 PhD School 2014 52 Outline • Background – – – – What is trajectory Where do they come from Why are they useful Characteristics • Trajectory query processing – Query classification – Trajectory distance measures – Trajectory index • Application – Popular route mining – Co-traveller discovery – Trajectory clustering 23/07/2014 PhD School 2014 53 Popular route mining • Shortest/fastest path may not be the most ideal – In practice drivers may consider many factors, which are hard to model – Data-driven approach: find the most popular/desirable path using the trajectories of past travellers • Methodologies – – – – 23/07/2014 Frequent sequential pattern Statistical approach Graph based approach Rank based approach PhD School 2014 54 Frequent sequential pattern • Output: a set of individual trajectories that share the property of visiting the same sequence of places with similar travel times 1. Spatial discretization: discretize space into finite set of regions of interest (RoI) 2. Translate trajectories into sequence of RoIs 3. Adapt sequential pattern mining algorithms with time constrain F. Giannotti, M. Nanni, F. Pinelli, and D. Pedreschi. Trajectory pattern mining. In SIGKDD, pages 330–339, 2007 23/07/2014 PhD School 2014 55 Periodic pattern • Find the repeat pattern for individual’s trajectory 1. Pre-defined and synchronized timestamps 2. Adaptive region: density-based PoI cluster 3. Trajectories are translated to sequence of regions at pre-defined time instances N. Mamoulis, H. Cao, G. Kollios, M. Hadjieleftheriou, Y. Tao, and D. W. Cheung. Mining, indexing, and querying historical spatiotemporal data. In SIGKDD, pages 236–245, 2004. 23/07/2014 PhD School 2014 56 Popular route search • Given source/destination, find/estimate the most popular route in between • Ideally, we can just count the number of trajectories on different paths connecting the two locations Z. Chen, H. T. Shen, and X. Zhou. Discovering popular routes from trajectories. In ICDE, pages 900–911, 2011 23/07/2014 PhD School 2014 57 Popular route discovery • In real scenarios, it’s not easy to find such well-divided groups • Even worse, there’s no trajectory connecting source/destination at all! 23/07/2014 PhD School 2014 58 Statistical approach • Construct a transfer network from raw trajectories as an intermediate result to capture the moving behaviors between locations – Node: major road intersections – Edge: trajectories connecting two nodes 23/07/2014 PhD School 2014 59 Statistical approach • Transfer probability • Find the route with the highest joint transfer probability w.r.t. destination 23/07/2014 PhD School 2014 60 Graph based approach • Statistical approach relies on accurate estimation of transfer probability – Only high sampled trajectories can derive this info V1 V2 Pr[v1 v2] =? V3 Kai Zheng, Yu Zheng, Xing Xie and Xiaofang Zhou. Reducing Uncertainty of Low-Sampling-Rate Trajectories. ICDE 2012 23/07/2014 PhD School 2014 61 Graph based approach • In many trajectory databases, low sampled trajectories constitute the majority – Beijing taxi trajectories, over 90% has >2min time gap between consecutive samples • These trajectories, though low sampled, also contain valuable info – How can we take advantage of them? 23/07/2014 PhD School 2014 62 Find popular route from uncertain trajectories • Can we find popular routes for specific s/d using low-sampling-rate trajectories Infrequent samples on the same path can reinforce each other, and they collectively form a more ‘dense’ trajectory 23/07/2014 PhD School 2014 63 Find popular route from uncertain trajectories • Construct a traversal graph using the low sampled trajectories in surrounding area of query locations • Conduct inference based on traversal graph Kai Zheng, Yu Zheng, Xing Xie and Xiaofang Zhou. Reducing Uncertainty of Low-Sampling-Rate Trajectories. In ICDE 2012 23/07/2014 PhD School 2014 64 Find popular route from uncertain trajectories without road networks 1. Discretize space into disjoint cells 2. Derive the transfer graph using cells 3. Infer frequent ‘virtual edges’ on the graph L.-Y. Wei, Y. Zheng, and W.-C. Peng. Constructing popular routes from uncertain trajectories. In ACM SIGKDD, pages 195–203, 2012. 23/07/2014 PhD School 2014 65 Rank based approach • Desired properties for a MFP – Suffix optimal: suffix of a MFP is also a MFP – Length insensitive: MFP shouldn’t favor long/short route – Bottleneck free: MFP shouldn’t contain infrequent edges Wuman Luo, Haoyu Tan, Lei Chen, Lionel M. Ni. Finding Time Period-Based Most Frequent Path in Big Trajectory Data. SIGMOD 2013 23/07/2014 PhD School 2014 66 Rank based approach • Construct a traversal graph – Edge frequency: number of trajectories directly connecting vs vd during T – Path frequency: non-decreasingly sorted sequence of edge frequencies • Rank based path frequency – How to compare two PFs? According to their first different value – (10,10,14) > (8,14) > (1,21,21) EF v1 v3 v2: 14, v2 v3: 10 v12: 10, v2 v12: 8 PF V1 v2 v12: (8, 14) V1 v2 v3 v12: (10,10,14) V1 v10 v11 v12: (1,21,21) Wuman Luo, Haoyu Tan, Lei Chen, Lionel M. Ni. Finding Time Period-Based Most Frequent Path in Big Trajectory Data. SIGMOD 2013 23/07/2014 PhD School 2014 67 Time period-based most frequent path (TPMFP) • Compare path frequency – More-frequent-than relation (>) – F > F’ if their first different value fj > fj’ – (10,10,14) > (8,14) > (1,21,21) • Property – – – – 23/07/2014 A total order, so MFP always exist Guarantee the suffix optimal Length of path doesn’t matter a lot Path with infrequent edge frequency be disadvantaged PhD School 2014 68 Summary • Background – – – – What is trajectory Where do they come from Why are they useful Characteristics • Trajectory query processing – Query classification – Trajectory distance measures – Trajectory index • Application – Popular route mining – Co-traveller discovery – Trajectory clustering 23/07/2014 PhD School 2014 69 Thank you • Questions – kevinz@itee.uq.edu.au • DKE group at UQ – http://www.itee.uq.edu.au/dke/dke-lab • My research – http://staff.itee.uq.edu.au/kevinz/ 23/07/2014 PhD School 2014 70