Analysing Big Trajectory Data

advertisement
Analysing Big Trajectory Data
Fundamental to Applications
Kevin Zheng
23/07/2014
PhD School 2014
1
Outline
• Background
–
–
–
–
What is trajectory
Where do they come from
Why are they useful
Characteristics
• Trajectory query processing
– Query classification
– Trajectory distance measures
– Trajectory index
• Applications
– Popular route mining
– Co-traveller discovery
– Trajectory clustering
23/07/2014
PhD School 2014
2
Outline
• Background
–
–
–
–
What is trajectory
Where do they come from
Why are they useful
Characteristics
• Trajectory query processing
– Query classification
– Trajectory distance measures
– Trajectory index
• Applications
– Popular route mining
– Co-traveller discovery
– Trajectory clustering
23/07/2014
PhD School 2014
3
What is trajectory?
• Historical location records of moving objects
• In mathematics
– Continuous function: time location
– Location can be any dimension
• In real applications
– Locations are sampled periodically
– A finite sequence of time-stamped locations: <p1,
t1>, <p2, t2> …, <pn, tn>
– p: two or three dimensions (longitude, latitude)
23/07/2014
PhD School 2014
4
Where is it from?
23/07/2014
PhD School 2014
5
Where is it from?
• Location tracking devices
– Vehicles, mobile phone users, animals
• Records in cyber-space (activity trajectory)
– SNS: Twitter, Facebook, Weibo
– Credit card statement
• Sensors (passive mode)
– Surveillance cameras, WiFi
23/07/2014
PhD School 2014
6
Who cares about it?
• Government
– Traffic analysis and predication
– Urban planning
• Business
– Location-based service providers
– Personalized advertisement
– Taxi company, logistic company
• Scientists & Researchers
– Zoologist, meteorologist
– Research interest
23/07/2014
PhD School 2014
7
BTW: Apple also cares a lot
• If you’ve got an iPhone 4S or later running
iOS 7, follow these steps:
– 1. Go to Settings, then Privacy > Location
Services.
– 2. Scroll to the bottom of the list of apps that use
Location Services. Tap System Services.
Yours is still on?
23/07/2014
PhD School 2014
8
Trajectory data are BIG
• Volume
• Velocity
• Variety
23/07/2014
PhD School 2014
9
Volume
• In 2010, 1 billion vehicles
– Taxi, logistic companies keep tracking their
vehicles
– Self-driving car in near future?
• In 2012, 1.08 billion smartphone users
• In 2013, 20 million surveillance cameras in
China
• They are generator!
– The data keep accumulated
23/07/2014
PhD School 2014
10
Velocity
• Not just huge, they’re being generated quickly
• Vehicle tracking & navigation
– Re-position every few seconds
• Geo-tagged social media
– 2 million Flickr photos per day, 5% geo-tagged
– 100 million posts on Sina Weibo per day, 1-2%
geo-tagged
– 400 million tweets per day, 1% geo-tagged
• Sensors
– How many cars pass a road camera every day?
23/07/2014
PhD School 2014
11
Geo-tagged tweets
Images courtesy of Twitter
23/07/2014
PhD School 2014
12
Variety
• Data format
–
–
–
–
Longitude, latitude: GPS trajectory
ID of POI: check-in record
ID of road segment: navigation applications
Bus/metro stop: travel smart card
• Tracking devices
– Car GPS, smartphones, sensors
• Sampling rate
• Data quality and uncertainty
23/07/2014
PhD School 2014
13
Research directions
•
•
•
•
•
Scalable, real-time data processing
Flexible database storage and index
Effective similarity measures
Uncertainty management
Data compression
Key and fundamental research problem: similarity-based analysis
23/07/2014
PhD School 2014
14
Outline
• Background
–
–
–
–
What is trajectory
Where do they come from
Why are they useful
Characteristics
• Trajectory query processing
– Query classification
– Trajectory distance measures
– Trajectory index
• Application
– Popular route mining
– Co-traveller discovery
– Trajectory clustering
23/07/2014
PhD School 2014
15
Trajectory query processing
• Input: a trajectory dataset D, a query object Q
• Output: a subset of D that are ‘relevant’ to Q
• Types of Q
– Point, region, trajectory
• Relevance to Q
– Distance function
• How to search efficiently?
– Trajectory index
23/07/2014
PhD School 2014
16
Query types
• NN query
– Query: point(s)
• Range query
– Query: region
• Similarity query
– Query: trajectory
23/07/2014
PhD School 2014
17
NN query (single point)
Find a moving object that was the closest to a given query location
Query location: q
Temporal constraint (optional): tc = [ts, te]
te
ts
q
ts
Nearest
sample
point
Actual
nearest
location
• It’s more costly to find nn according to
actual nearest location
• For a high sampled trajectory, nearest sample
point is a good approximation
te
[Tao2002] Tao Y., Papadias D. and Shen Q., Continuous nearest neighbour search, VLDB, 2002
23/07/2014
PhD School 2014
18
NN query (multiple points)
q1
q2
Query locations Q: q1, q2, q3, q4
q3
D(Q,T) is an aggregate
function of D(q,T)
q4
[Chen2010] Chen Z., Shen HT., Zhou X., Zheng Y and Xie X., Searching trajectories by locations – an efficiency study. SIGMOD 2010
23/07/2014
PhD School 2014
19
Range query
• Find the moving objects that traveled within a given region during a given
time interval
• Spatial region: R
• Temporal interval:[ts, te]
R
ts
te
ts
te
[Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000
23/07/2014
PhD School 2014
20
Similarity query
• Find the moving objects that had similar travel
history with a given trajectory
How to measure their
distance?
Tq
23/07/2014
PhD School 2014
21
Trajectory distance measures
• Compared to point-to-point, point-to-trajectory
distance, trajectory-to-trajectory distance
measure is more complicated
• More factors to consider
– Alignment of sampling points
– Consider temporal dimension?
– Robust to noise?
23/07/2014
PhD School 2014
22
Classification
• Alignment of samples
– Lock-step vs. adaptive alignment
• Similarity metric
– Geographical distance vs. count based
• Continuity
– Discrete vs. continuous
• Dimension
– Spatial-only vs. spatio-temporal
• Underlying space
– Euclidean space vs. road network
23/07/2014
PhD School 2014
23
Lock-step alignment
• The k-th point of trajectory A is aligned to the
k-th point of trajectory B
23/07/2014
PhD School 2014
24
Drawback
• Cannot find similar trajectories with different
sampling rates
• Sensitive to noise
23/07/2014
PhD School 2014
25
Adaptive alignment
• Dynamic Time Warping distance
– Adaptation from time series distance measure
– Used to handle time shift and scale in time series
• Optimal order-aware alignment between two
sequences
– Goal: minimize the aggregate distance between
matched points
• 1-to-many mapping
Yi, Byoung-Kee, Jagadish, HV and Faloutsos, Christos, Efficient retrieval of similar time sequences under time warping. ICDE 1998
23/07/2014
PhD School 2014
26
DTW for trajectories
• Useful when detecting similar trajectories with
different sampling rates
23/07/2014
PhD School 2014
27
Similarity metric
• Geographical distance based
– Similarity is measured by the geographical
distance between samples
– DTW
• Count based
– Similarity is measured by the number of ‘similar’/
‘dissimilar’ samples
– LCSS: count the similar sample pairs
– EDR: count the dissimilar sample pairs
23/07/2014
PhD School 2014
28
LCSS
• Longest Common Sub-Sequence
• Adaptation of string similarity
– Lcss(‘abcde’,’bd’) = 2
• Threshold-based equality relationship
– Two locations are regarded as equal if they’re
‘close’ enough (compared to a threshold)
• 1-to-(1 or null) mapping
VLACHOS, M., GUNOPULOS, D., AND KOLLIOS, G. Discovering similar multidimensional trajectories. ICDE 2002
23/07/2014
PhD School 2014
29
LCSS
• Insensitive to noise
• Not easy to define threshold
• May return dissimilar trajectories
p5
p3
p4
p2
p’3
p1
23/07/2014
p’1
p’2
PhD School 2014
30
EDR
• Edit Distance on Real sequence
• Adaptation from Edit Distance on strings
– Number of insert, delete, replace needed to convert
A into B
• Threshold-based equality relationship
– Two locations are regarded as equal if they’re
‘close’ enough (compared to a threshold)
Lei Chen, M. Tamer Ozsu, Vincent Oria, Robust and Fast Similarity Search for Moving Object Trajectories. SIGMOD 2005
23/07/2014
PhD School 2014
31
EDR
• Value means the number of operations, not
“distance between locations”
– Insensitive to noise
replace
p5
insert
p3
p4
p2
p’3
insert
23/07/2014
p1
p’1
p’2
PhD School 2014
32
LCSS and EDR
• They are both count-based
– LCSS counts the number of matched pairs
– EDR counts the cost of operations needed to fix
the unmatched pairs
• Higher LCSS, lower EDR
• If cost(replace) = cost(insert) + cost(delete):
• EDR(X,Y) = L(X)+L(Y) – 2LCSS(X,Y)
23/07/2014
PhD School 2014
33
Continuity
• Discrete measures
– Only consider the sample points of trajectory
– All previous measures
• Continuous measures
– Consider the line segments between samples
– OWD
– LIP
23/07/2014
PhD School 2014
34
OWD
• One Way Distance from T1 to T2 is:
– Integral of the distance from points of T1 to T2
– Divided by the length of T1
• Make it into symmetric measure
Bin Lin, Jianwen Su, One Way Distance: For Shape Based Similarity Search of Moving Object Trajectories. In Geoinformatica (2008)
23/07/2014
PhD School 2014
35
OWD example
• Consider one trajectory as piece-wise line
segment, and the other as discrete samples
23/07/2014
PhD School 2014
36
LIP distance
• Locality In-between Polylines
– Polygon is the set of polygons formed between
intersection points
–
Nikos Pelekis et al, Similarity Search in Trajectory Databases. Symposium on Temporal Representation and Reasoning 2007
23/07/2014
PhD School 2014
37
LIP distance
• Only work for 2-dimensional trajectories
• Polygon polyhedron: non-trivial change
23/07/2014
PhD School 2014
38
Dimension
• Spatial only
– Disregard the time information on sample points
• Spatio-temporal
– Take the timestamp into consideration
– Synchronous Euclidean Distance
23/07/2014
PhD School 2014
39
SED
• Synchronous Euclidean Distance
– Euclidean distance between locations at the same
time instance of two trajectories
• Regard trajectory as continuous function of
time
Mirco Nanni, Dino Pedreschi, Time-focused clustering of trajectories of moving objects. Journal of Intelligent Information Systems (2006)
POTAMIAS, M., PATROUMPAS, K., AND SELLIS, T. K. Sampling trajectory streams with spatiotemporal criteria. SSDBM 2006
23/07/2014
PhD School 2014
40
SED
Virtually create a
sample point at t=3
t=10
t=20
t=0
t=15
t=5
t=10
t=3
t=0
23/07/2014
t=25
t=12
t=7
PhD School 2014
41
Underlying space
• Euclidean space
• Road network
– Distance is defined
along shortest path
Jung-Rae Hwang, Hye-Young Kang, Ki-Joune Li, Searching for Similar Trajectories on Road Networks Using Spatio-temporal Similarity.
Advances in Databases and Information Systems, pp 282 – 295, 2006
23/07/2014
PhD School 2014
42
Trajectory index
• For a single trajectory, it’s easy to calculate the
nearest distance to a query location or check if
it falls inside a given region
• But what about millions or even more
– Linear scan?
• We need to index trajectories
– Generic trajectory index for NN and range query
– For similarity query, index is usually dependent on
similarity measures
23/07/2014
PhD School 2014
43
Generic trajectory index
• Treat spatial and temporal dimension equally
– 3D R-tree
– STR-tree (Spatio-temporal R-tree)
– TB-tree (Trajecoty Bundle)
• Index temporal dimension first
– Multi-version R-tree, Historical R-tree
• Index spatial dimension first
– Grid based index
23/07/2014
PhD School 2014
44
3D R-tree
• Indexing position samples only
– Cannot answer queries about movements inbetween those samples
• Indexing line segments
Time
y
x
23/07/2014
PhD School 2014
45
Problem of R-tree
• Large “dead space”
– MBB covers a large portion of the space with no data
– Low pruning power
• Trajectory preservation
– Line segments are grouped merely by spatial
proximity
– Regardless which trajectory they belong to
– Retrieving a trajectory requires multiple visits to
different paths in the tree
23/07/2014
PhD School 2014
46
Augmented 3D R-tree
• Augment leaf node with orientation
information
• Distance to line segments can be approximated
more accurately
q
[Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000
23/07/2014
PhD School 2014
47
STR-tree
• Extension of augmented 3D R-tree
• Insertion
– Try to keep the line segments belonging to the
same trajectory together
– Find the leaf node containing the predecessor
• Node split
– Put disconnected segments into new node
– Put the most recent backward-connected segment
into new node
[Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000
23/07/2014
PhD School 2014
48
TB-tree (Trajectory Bundle)
• A leaf node only contains segments belonging to the same
trajectory
• Leaf nodes containing same trajectory are linked
• Strictly preserve trajectories
[Pfoster 2000] Dieter Pfoster, Christian S. Jensen, Yannis T., Novel approaches to the indexing of moving object trajectories. VLDB, 2000
23/07/2014
PhD School 2014
49
Multi-version R-tree
For each timestamp, an R-tree is
created. So, there are many R-trees.
These R-trees are indexed.
HR-tree
Query for trajectories in a given region and in a given time interval:
1. The R-tree at the timestamp is found first
2. The trajectories in the specified region are retrieved from the R-tree.
[Nascimento1998] Nascimento, M., Silva, J. Towards Historical R-trees. ACM SAC, 1998
[Tao2001a] Tao, Y., Papadias, D.: Efficient historical r-trees. In: ssdbm, p. 0223. Published by the IEEE Computer Society (2001)
[Xu2005]Xu, X., Han, J., Lu, W.: Rt-tree: An improved r-tree indexing structure for temporal spatial databases. In: Int. Symp. on Spatial Data Handling,
2005
[Tao2001b] Tao, Y., Papadias, D.: Mv3r-tree: A spatio-temporal access method for timestamp and interval queries. In: VLDB, pp. 431-440 (2001)
23/07/2014
PhD School 2014
50
Grid-based index
• Partition space into non-overlapping cells
• Trajectory segments in each cell are indexed
on temporal dimension
• Query processing:
– Spatial filtering
– Temporal filtering
[Prasad2003] V. Prasad Chakka Adam C. Everspaugh Jignesh M., Patel, Indexing Large Trajectory Data Sets With SETI, CIDR 2003
23/07/2014
PhD School 2014
51
Grid-based index
[Prasad2003] V. Prasad Chakka Adam C. Everspaugh Jignesh M., Patel, Indexing Large Trajectory Data Sets With SETI, CIDR 2003
23/07/2014
PhD School 2014
52
Outline
• Background
–
–
–
–
What is trajectory
Where do they come from
Why are they useful
Characteristics
• Trajectory query processing
– Query classification
– Trajectory distance measures
– Trajectory index
• Application
– Popular route mining
– Co-traveller discovery
– Trajectory clustering
23/07/2014
PhD School 2014
53
Popular route mining
• Shortest/fastest path may not be the most ideal
– In practice drivers may consider many factors, which
are hard to model
– Data-driven approach: find the most popular/desirable
path using the trajectories of past travellers
• Methodologies
–
–
–
–
23/07/2014
Frequent sequential pattern
Statistical approach
Graph based approach
Rank based approach
PhD School 2014
54
Frequent sequential pattern
• Output: a set of individual trajectories that
share the property of visiting the same
sequence of places with similar travel times
1. Spatial discretization: discretize space into finite set of regions of interest (RoI)
2. Translate trajectories into sequence of RoIs
3. Adapt sequential pattern mining algorithms with time constrain
F. Giannotti, M. Nanni, F. Pinelli, and D. Pedreschi. Trajectory pattern mining. In SIGKDD, pages 330–339, 2007
23/07/2014
PhD School 2014
55
Periodic pattern
• Find the repeat pattern for individual’s
trajectory
1. Pre-defined and synchronized timestamps
2. Adaptive region: density-based PoI cluster
3. Trajectories are translated to sequence of regions
at pre-defined time instances
N. Mamoulis, H. Cao, G. Kollios, M. Hadjieleftheriou, Y. Tao, and D. W. Cheung. Mining, indexing, and querying historical spatiotemporal
data. In SIGKDD, pages 236–245, 2004.
23/07/2014
PhD School 2014
56
Popular route search
• Given source/destination, find/estimate the
most popular route in between
• Ideally, we can just count the number of
trajectories on different paths connecting the
two locations
Z. Chen, H. T. Shen, and X. Zhou. Discovering popular routes from trajectories. In ICDE, pages 900–911, 2011
23/07/2014
PhD School 2014
57
Popular route discovery
• In real scenarios, it’s not easy to find such
well-divided groups
• Even worse, there’s no trajectory connecting
source/destination at all!
23/07/2014
PhD School 2014
58
Statistical approach
• Construct a transfer network from raw trajectories
as an intermediate result to capture the moving
behaviors between locations
– Node: major road intersections
– Edge: trajectories connecting two nodes
23/07/2014
PhD School 2014
59
Statistical approach
• Transfer probability
• Find the route with the highest joint transfer
probability w.r.t. destination
23/07/2014
PhD School 2014
60
Graph based approach
• Statistical approach relies on accurate
estimation of transfer probability
– Only high sampled trajectories can derive this info
V1
V2
Pr[v1
v2] =?
V3
Kai Zheng, Yu Zheng, Xing Xie and Xiaofang Zhou. Reducing Uncertainty of Low-Sampling-Rate Trajectories. ICDE 2012
23/07/2014
PhD School 2014
61
Graph based approach
• In many trajectory databases, low sampled
trajectories constitute the majority
– Beijing taxi trajectories, over 90% has >2min time
gap between consecutive samples
• These trajectories, though low sampled, also
contain valuable info
– How can we take advantage of them?
23/07/2014
PhD School 2014
62
Find popular route from uncertain
trajectories
• Can we find popular routes for specific s/d
using low-sampling-rate trajectories
Infrequent samples on the same path can reinforce each other, and
they collectively form a more ‘dense’ trajectory
23/07/2014
PhD School 2014
63
Find popular route from uncertain
trajectories
• Construct a traversal graph using the low
sampled trajectories in surrounding area of
query locations
• Conduct inference based on traversal graph
Kai Zheng, Yu Zheng, Xing Xie and Xiaofang Zhou. Reducing Uncertainty of Low-Sampling-Rate Trajectories. In ICDE 2012
23/07/2014
PhD School 2014
64
Find popular route from uncertain
trajectories without road networks
1. Discretize space into disjoint cells
2. Derive the transfer graph using cells
3. Infer frequent ‘virtual edges’ on the graph
L.-Y. Wei, Y. Zheng, and W.-C. Peng. Constructing popular routes from uncertain trajectories. In ACM SIGKDD, pages 195–203, 2012.
23/07/2014
PhD School 2014
65
Rank based approach
• Desired properties for a MFP
– Suffix optimal: suffix of a MFP is also a MFP
– Length insensitive: MFP shouldn’t favor long/short
route
– Bottleneck free: MFP shouldn’t contain infrequent
edges
Wuman Luo, Haoyu Tan, Lei Chen, Lionel M. Ni. Finding Time Period-Based Most Frequent Path in Big Trajectory Data. SIGMOD 2013
23/07/2014
PhD School 2014
66
Rank based approach
• Construct a traversal graph
– Edge frequency: number of trajectories directly connecting vs vd during
T
– Path frequency: non-decreasingly sorted sequence of edge frequencies
• Rank based path frequency
– How to compare two PFs? According to their first different value
– (10,10,14) > (8,14) > (1,21,21)
EF
v1
v3
v2: 14, v2 v3: 10
v12: 10, v2 v12: 8
PF
V1 v2 v12: (8, 14)
V1 v2 v3 v12: (10,10,14)
V1 v10 v11 v12: (1,21,21)
Wuman Luo, Haoyu Tan, Lei Chen, Lionel M. Ni. Finding Time Period-Based Most Frequent Path in Big Trajectory Data. SIGMOD 2013
23/07/2014
PhD School 2014
67
Time period-based most frequent
path (TPMFP)
• Compare path frequency
– More-frequent-than relation (>)
– F > F’ if their first different value fj > fj’
– (10,10,14) > (8,14) > (1,21,21)
• Property
–
–
–
–
23/07/2014
A total order, so MFP always exist
Guarantee the suffix optimal
Length of path doesn’t matter a lot
Path with infrequent edge frequency be disadvantaged
PhD School 2014
68
Summary
• Background
–
–
–
–
What is trajectory
Where do they come from
Why are they useful
Characteristics
• Trajectory query processing
– Query classification
– Trajectory distance measures
– Trajectory index
• Application
– Popular route mining
– Co-traveller discovery
– Trajectory clustering
23/07/2014
PhD School 2014
69
Thank you
• Questions
– kevinz@itee.uq.edu.au
• DKE group at UQ
– http://www.itee.uq.edu.au/dke/dke-lab
• My research
– http://staff.itee.uq.edu.au/kevinz/
23/07/2014
PhD School 2014
70
Download