Spatio-Temporal Aggregation Using Sketches

advertisement
Spatio-Temporal Aggregation Using Sketches
Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias
Department of Computer Science
City University of Hong Kong, Boston University,
Hong Kong University of Science and Technology
18, March, 2004
Computer Science
Outline
•
•
•
•
•
•
Applications and motivation
Preliminaries –Aggregate trees and sketch techniques
Distinct spatio-temporal aggregation
Performance study
Extensions
Conclusion
Spatio-Temporal Aggregate Query -- Applications
• Traffic Supervision Systems
– Monitoring the number of vehicles in a district, the information could
be used to identify the traffic jam area etc.
• Mobile Computing Applications
– Allocating bandwidth depending on the usage of each region
Example: For wireless companies, they would like to know the number of
cell phone users in a particular region in a specified period. In addition, it
is also interesting to know the total number of phone calls made by all
users who qualified the first query.
Spatio-Temporal Aggregate Query
• Spatio-Temporal Application requires the retrieval of summarized
information about moving objects
• Given an aggregate query region as a rectangle qr and query interval
qt, a spatio temporal aggregate query retrieves information about
objects that appeared in qr during qt
– Spatio-Temporal Count
• Returns the total number of qualifying objects
– Spatio-Temporal Sum
• Each object associated with a measure, outputs the sum of the
measures of the qualifying objects.
Existing Approach: multi-tree structures based on R-trees and B-trees
– Problem: If an object remains in the query region for several
timestamps during the query interval, it will be counted (or
summed ) multiple times in the result.
Spatio-Temporal Aggregate Query (cont.)
How to answer “Distinct Aggregate Query” ?
e.g: How many cars are present in a district?
90
Stadium
Motivation: Distinct Spatio-Temporal Aggregate Query
Enable a much richer range of decision-making queries
But: There is no way to exactly summarize distinct objects
substantially better than by simply enumerating all of them
Solution:
Spatio-Temporal Aggregation Index Trees
Sketch Techniques
Example
regions
R1
r1
r2
r4
r3
r4
132 127 125 127 127
r3
12
r2
qr
R2
Query qr retrieve the aggregate sum
(during time T1-T3) of all rectangles
that intersect it.
75
12
80
12
85
12
90
12
90
r 1 150 150 145 135 130
1
2
3
time
4
5
Preliminaries -- Aggregate RB-tree
In the aRB-tree, the extents of all regions (in this case r1,r2,…,r4) are stored in an
R-tree. Each (leaf/non-leaf) entry of the R-tree is associated with a pointer to a
B-tree that stores historical aggregate data about the entry
B-tree for R
B-tree for R
2
1
1 685 4 445
1 283 3 405
R-tree for the
4 225 5 220 spatial dimensions 1 144 2 139
1 225 2 230
B-tree for r1
R
1
1 445 4 265
1 150 3 145
r
1
r2
4 135 5 130
B-tree for r2
R2
B-tree for r4
r
3
r4
B-tree for r3
3 85 4 90
1 259 3 379
1 132 2 127
1 155 3 265
1 75 2 80
3 137 4 139
1 12
3 125 4 127
Preliminaries – Flajolet-Martin sketches
• Goal: Small-space representation of a set of items.
Prerequisite: Let h be a random,
binary hash function.
X
0
Z
1
∩
1
X Z
0
0
1
0
0
0
0
0
Sketch of an item
For each unique item with ID x,
For each integer 1 ≤ i ≤ k in turn,
0
1
0
0
Compute h (x, i).
Stop when h (x, i) = 1, and set bit i.
• Sketch of a union of items is the OR of their bitmaps.
Preliminaries – Flajolet-Martin sketches (cont.)
S
1
1
1
0
Estimating COUNT
1
Take the bitmap of a set of N items.
Let j be the position of the leftmost
zero in the bitmap.
j=3
j is an estimator of log2 (0.77 N)
Best guess: COUNT ~ 11
Fixable drawbacks:
•
Variance in the estimate is large.
Preliminaries – Flajolet-Martin sketches (cont.)
Standard variance reduction methods apply.
•
Compute m independent bitmaps in parallel.
•
Generate m independent estimates of N.
•
Take the mean of the estimates.
Provable tradeoffs between m and variance of the estimator.
Distinct Spatio-Temporal Aggregation
Exact Solution
If n is the number of distinct objects and T is the total number of
timestamps in history, the exact solution requires W(n∙T) space.
Existing Aggregation Approach
aRB tree stores only the summarized data, information about individual
objects is lost and the problem cannot be solved.
Our Solution
• Combining aRB tree with FM sketch technique!
For each region ri and every timestamp t we maintain a sketch si(t) that
captures the (ids of) objects in ri at t.
• Requires O(m∙R∙T∙logn) space.
where R is the number of regions and m is an adjustable constant specifying
number of bitmaps used by one sketch. (determines the tradeoff between
overhead and approximation accuracy)
System Architecture
r
object ids or weights
sk etch
producers
r
1
2
object ids
object ids
or weights
or weights
r
3
sk etches
approx. results
r
database
4 10000 11000 10000 10100 10100
r 3 01000 10000 10000 10000 11111
regions
The sketches can be stored in a two
dimensional array
r 2 10100 10000 10000 11000 10001
r
aggregate queries
1 10000 01100 01100 11100 10100
1
2
3
4
5
time
Sketch Indexing Structures
R1
r1
r2
r3
r4
qr qt=(1,4)
<time, sketch>
R2
B-tree for R
1 11100
R-tree for the
spatial dimensions
R1 R2
1
4 11101
r1
r2
r3
The sketch of a non-leaf entry in B-tree
equals to the OR of all the sketches in
its sub-trees.
r4
B-tree for R
N1
1 11000
2
4 11111
N2
1 10100
2 11100
4 11100
B-tree for r 1
1 11100
1 10000
2 01100
1 11000
4 11100
3 10000
4 10100
B-tree for r
1 11000
4 11101
B-tree for r
1 11100
5 10101
5 10100
1 01000
2 10000
4 11101
1 11000
3
5 11111
5 11111
B-tree for r
2
511111
4
N3
3 10100
N4
1 10100
3 10000
4 11000
5 10001
1 10000
2 11000
3 10000
4 10100
Query Processing
• Similar to the query processing technique in aRB tree.
Basic Idea: The spatial and temporal searching conditions are applied
alternatively. The result sketch is incrementally updated.
• Can be improved by applying some pruning techniques.
Heuristic 1: Let RS be the current result sketch, and e a non-leaf B-tree
entry whose associated sketch is se. Then, the sub-tree of e can be pruned
if (se OR RS) = RS.
Heuristic 2: Given a set of entries that cannot be pruned by Heuristic 1,
we visit their child nodes in descending order of the number of 1’s in their
sketches.
And more heuristics!
Query Processing – Supporting Distinct Sum Query
Extending FM sketches
•
FM sketches can handle this :
- to insert a value of 500, perform 500 distinct item insertions
•
Our observation: We can simulate a large number of insertions
into an FM sketch more efficiently.
Performance
•
•
•
•
Dataset settings
– Number of cities = 10,000
– Number of buses = 100,000
– History length = 1,00 timestamps
– Number of passengers for each bus = [200,300]
– At each timestamp, bus reports to its nearest city, <time t, city c, bus b,
passenger # a>
Each query contains 2 parameters: (spatial extents and interval length)
A count query retrieves the number of distinct buses that report to cities in qr
during qt, while a sum query returns the sum of these buses’ passengers
Compare the sketch-index to the relational approach: index the 4-tuple table
<t,c,b,a> using a B-tree on the time t column
Results (Space Consumption)
160 size (mega bytes)
140
120
100
80
60
40
20
0
database
8
16
32
size
number of bitmaps per sketch
Size of sketch index could be further reduced
by applying simple compression techniques!
Results (Sketch Pruning in Query)
sketch-pruning
naive
relational
number of disk accesses
900
800
700
600
500
400
300
200
100
0
0.05
0.1
0.15
0.2
query rectangle length
(a) Cost vs. qrlen (qtlen=10)
0.25
Results (Sketch Pruning in Query)
sketch-pruning
naive
relational
number of disk accesses
600
500
400
300
200
100
0
1
5
10
15
query interval length
(b) Cost vs. qtlen (qrlen=0.15)
20
Results (Accuracy of Approximate Results)
32-bitmap
16-bitmap
8-bitmap
35% relative error
30%
25%
20%
15%
10%
5%
0%
0.05
0.1
0.15
0.2
query rectangle length
(a) Error vs. qrlen (qtlen=10, count)
0.25
Results (Accuracy of Approximate Results)
32-bitmap
16-bitmap
8-bitmap
25% relative error
20%
15%
10%
5%
0%
0.05
0.1
0.15
0.2
query rectangle length
(b) Error vs. qrlen ( qtlen=10, sum)
0.25
Results (Costs of Indexes)
32-bitmap
16-bitmap
8-bitmap
400 number of disk accesses
350
300
250
200
150
100
50
0
0.05
0.1
0.15
0.2
query rectangle length
(a) Cost vs. qrlen (qtlen=10)
0.25
Results (Costs of Indexes)
32-bitmap
350
16-bitmap
8-bitmap
number of disk accesses
300
250
200
150
100
50
0
1
5
10
15
query interval length
(b) Cost vs. qtlen (qrlen=0.15)
20
Extensions
•
•
Approximating general moving data
Problem: Each object o reports its location <x,y> at each timestamp t, the size
of the database grows continuously!  O(n∙T)
Solution: Impose a resres regular grid over the data space, the sketch index is
applied by treating the grid cells as the finest aggregate granularity. 
O((res)2∙T∙logn) [or, O(T∙logn) when res is a constant ]
B-tree
B-tree
B-tree
Level 0
B-tree
Level 1
B-tree
B-tree
Level L
Conclusion
• We propose a sketch index that integrates traditional approximate
counting techniques with spatio-temporal indexes for efficient distinct
aggregation query processing in spatio-temporal database.
• Sketch index consumes less space and give an order of magnitude
faster query process with less aggregate error than a conventional
database.
• Extensions and Future work
– Other possible sketches
– More sophisticated algorithms for mining association rules
Download