Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer Science City University of Hong Kong, Boston University, Hong Kong University of Science and Technology 18, March, 2004 Computer Science Outline • • • • • • Applications and motivation Preliminaries –Aggregate trees and sketch techniques Distinct spatio-temporal aggregation Performance study Extensions Conclusion Spatio-Temporal Aggregate Query -- Applications • Traffic Supervision Systems – Monitoring the number of vehicles in a district, the information could be used to identify the traffic jam area etc. • Mobile Computing Applications – Allocating bandwidth depending on the usage of each region Example: For wireless companies, they would like to know the number of cell phone users in a particular region in a specified period. In addition, it is also interesting to know the total number of phone calls made by all users who qualified the first query. Spatio-Temporal Aggregate Query • Spatio-Temporal Application requires the retrieval of summarized information about moving objects • Given an aggregate query region as a rectangle qr and query interval qt, a spatio temporal aggregate query retrieves information about objects that appeared in qr during qt – Spatio-Temporal Count • Returns the total number of qualifying objects – Spatio-Temporal Sum • Each object associated with a measure, outputs the sum of the measures of the qualifying objects. Existing Approach: multi-tree structures based on R-trees and B-trees – Problem: If an object remains in the query region for several timestamps during the query interval, it will be counted (or summed ) multiple times in the result. Spatio-Temporal Aggregate Query (cont.) How to answer “Distinct Aggregate Query” ? e.g: How many cars are present in a district? 90 Stadium Motivation: Distinct Spatio-Temporal Aggregate Query Enable a much richer range of decision-making queries But: There is no way to exactly summarize distinct objects substantially better than by simply enumerating all of them Solution: Spatio-Temporal Aggregation Index Trees Sketch Techniques Example regions R1 r1 r2 r4 r3 r4 132 127 125 127 127 r3 12 r2 qr R2 Query qr retrieve the aggregate sum (during time T1-T3) of all rectangles that intersect it. 75 12 80 12 85 12 90 12 90 r 1 150 150 145 135 130 1 2 3 time 4 5 Preliminaries -- Aggregate RB-tree In the aRB-tree, the extents of all regions (in this case r1,r2,…,r4) are stored in an R-tree. Each (leaf/non-leaf) entry of the R-tree is associated with a pointer to a B-tree that stores historical aggregate data about the entry B-tree for R B-tree for R 2 1 1 685 4 445 1 283 3 405 R-tree for the 4 225 5 220 spatial dimensions 1 144 2 139 1 225 2 230 B-tree for r1 R 1 1 445 4 265 1 150 3 145 r 1 r2 4 135 5 130 B-tree for r2 R2 B-tree for r4 r 3 r4 B-tree for r3 3 85 4 90 1 259 3 379 1 132 2 127 1 155 3 265 1 75 2 80 3 137 4 139 1 12 3 125 4 127 Preliminaries – Flajolet-Martin sketches • Goal: Small-space representation of a set of items. Prerequisite: Let h be a random, binary hash function. X 0 Z 1 ∩ 1 X Z 0 0 1 0 0 0 0 0 Sketch of an item For each unique item with ID x, For each integer 1 ≤ i ≤ k in turn, 0 1 0 0 Compute h (x, i). Stop when h (x, i) = 1, and set bit i. • Sketch of a union of items is the OR of their bitmaps. Preliminaries – Flajolet-Martin sketches (cont.) S 1 1 1 0 Estimating COUNT 1 Take the bitmap of a set of N items. Let j be the position of the leftmost zero in the bitmap. j=3 j is an estimator of log2 (0.77 N) Best guess: COUNT ~ 11 Fixable drawbacks: • Variance in the estimate is large. Preliminaries – Flajolet-Martin sketches (cont.) Standard variance reduction methods apply. • Compute m independent bitmaps in parallel. • Generate m independent estimates of N. • Take the mean of the estimates. Provable tradeoffs between m and variance of the estimator. Distinct Spatio-Temporal Aggregation Exact Solution If n is the number of distinct objects and T is the total number of timestamps in history, the exact solution requires W(n∙T) space. Existing Aggregation Approach aRB tree stores only the summarized data, information about individual objects is lost and the problem cannot be solved. Our Solution • Combining aRB tree with FM sketch technique! For each region ri and every timestamp t we maintain a sketch si(t) that captures the (ids of) objects in ri at t. • Requires O(m∙R∙T∙logn) space. where R is the number of regions and m is an adjustable constant specifying number of bitmaps used by one sketch. (determines the tradeoff between overhead and approximation accuracy) System Architecture r object ids or weights sk etch producers r 1 2 object ids object ids or weights or weights r 3 sk etches approx. results r database 4 10000 11000 10000 10100 10100 r 3 01000 10000 10000 10000 11111 regions The sketches can be stored in a two dimensional array r 2 10100 10000 10000 11000 10001 r aggregate queries 1 10000 01100 01100 11100 10100 1 2 3 4 5 time Sketch Indexing Structures R1 r1 r2 r3 r4 qr qt=(1,4) <time, sketch> R2 B-tree for R 1 11100 R-tree for the spatial dimensions R1 R2 1 4 11101 r1 r2 r3 The sketch of a non-leaf entry in B-tree equals to the OR of all the sketches in its sub-trees. r4 B-tree for R N1 1 11000 2 4 11111 N2 1 10100 2 11100 4 11100 B-tree for r 1 1 11100 1 10000 2 01100 1 11000 4 11100 3 10000 4 10100 B-tree for r 1 11000 4 11101 B-tree for r 1 11100 5 10101 5 10100 1 01000 2 10000 4 11101 1 11000 3 5 11111 5 11111 B-tree for r 2 511111 4 N3 3 10100 N4 1 10100 3 10000 4 11000 5 10001 1 10000 2 11000 3 10000 4 10100 Query Processing • Similar to the query processing technique in aRB tree. Basic Idea: The spatial and temporal searching conditions are applied alternatively. The result sketch is incrementally updated. • Can be improved by applying some pruning techniques. Heuristic 1: Let RS be the current result sketch, and e a non-leaf B-tree entry whose associated sketch is se. Then, the sub-tree of e can be pruned if (se OR RS) = RS. Heuristic 2: Given a set of entries that cannot be pruned by Heuristic 1, we visit their child nodes in descending order of the number of 1’s in their sketches. And more heuristics! Query Processing – Supporting Distinct Sum Query Extending FM sketches • FM sketches can handle this : - to insert a value of 500, perform 500 distinct item insertions • Our observation: We can simulate a large number of insertions into an FM sketch more efficiently. Performance • • • • Dataset settings – Number of cities = 10,000 – Number of buses = 100,000 – History length = 1,00 timestamps – Number of passengers for each bus = [200,300] – At each timestamp, bus reports to its nearest city, <time t, city c, bus b, passenger # a> Each query contains 2 parameters: (spatial extents and interval length) A count query retrieves the number of distinct buses that report to cities in qr during qt, while a sum query returns the sum of these buses’ passengers Compare the sketch-index to the relational approach: index the 4-tuple table <t,c,b,a> using a B-tree on the time t column Results (Space Consumption) 160 size (mega bytes) 140 120 100 80 60 40 20 0 database 8 16 32 size number of bitmaps per sketch Size of sketch index could be further reduced by applying simple compression techniques! Results (Sketch Pruning in Query) sketch-pruning naive relational number of disk accesses 900 800 700 600 500 400 300 200 100 0 0.05 0.1 0.15 0.2 query rectangle length (a) Cost vs. qrlen (qtlen=10) 0.25 Results (Sketch Pruning in Query) sketch-pruning naive relational number of disk accesses 600 500 400 300 200 100 0 1 5 10 15 query interval length (b) Cost vs. qtlen (qrlen=0.15) 20 Results (Accuracy of Approximate Results) 32-bitmap 16-bitmap 8-bitmap 35% relative error 30% 25% 20% 15% 10% 5% 0% 0.05 0.1 0.15 0.2 query rectangle length (a) Error vs. qrlen (qtlen=10, count) 0.25 Results (Accuracy of Approximate Results) 32-bitmap 16-bitmap 8-bitmap 25% relative error 20% 15% 10% 5% 0% 0.05 0.1 0.15 0.2 query rectangle length (b) Error vs. qrlen ( qtlen=10, sum) 0.25 Results (Costs of Indexes) 32-bitmap 16-bitmap 8-bitmap 400 number of disk accesses 350 300 250 200 150 100 50 0 0.05 0.1 0.15 0.2 query rectangle length (a) Cost vs. qrlen (qtlen=10) 0.25 Results (Costs of Indexes) 32-bitmap 350 16-bitmap 8-bitmap number of disk accesses 300 250 200 150 100 50 0 1 5 10 15 query interval length (b) Cost vs. qtlen (qrlen=0.15) 20 Extensions • • Approximating general moving data Problem: Each object o reports its location <x,y> at each timestamp t, the size of the database grows continuously! O(n∙T) Solution: Impose a resres regular grid over the data space, the sketch index is applied by treating the grid cells as the finest aggregate granularity. O((res)2∙T∙logn) [or, O(T∙logn) when res is a constant ] B-tree B-tree B-tree Level 0 B-tree Level 1 B-tree B-tree Level L Conclusion • We propose a sketch index that integrates traditional approximate counting techniques with spatio-temporal indexes for efficient distinct aggregation query processing in spatio-temporal database. • Sketch index consumes less space and give an order of magnitude faster query process with less aggregate error than a conventional database. • Extensions and Future work – Other possible sketches – More sophisticated algorithms for mining association rules