MD-HBase: A Scalable Multi-dimensional Data Infrastructure for

advertisement
MD-HBase: A Scalable Multi-dimensional Data
Infrastructure for Location Aware Services
S. Nishimura (NEC Service Platforms Labs.),
S. Das, D. Agrawal, A. Abbadi
(University of California, Santa Barbara)
Presenter: Zhuo Liu
Overview
▐ A Motivating Story
▐ Existing Technologies
▐ Our proposal
▐ Evaluation
▐ Conclusion
Page 2
Motivating Scenario: Mobile Coupon Distribution
Mobile Coupon
Distributer
Current
Location Current
Current
Location
Location
Coupon
Page 3
Distribution
Policy
• Area
• # of coupons
Motivating Scenario: Mobile Coupon Distribution
System Scalability
Efficient Complex Queries
Large amounts of Data
High Throughput
Multi-Dimensional Query
Nearest Neighbors Query
Current
Location
Current Current
Location Location
Current
Current
Location
Location
Current
Current
Location
Location Current
Current
Current Current
Location
Location
Current
Location Location
Location
125,000,000 subscribers
in Japan
Page 4
Coupon
Coupon
Coupon
Distribution Policy
• Area
• # of coupons
Existing Technologies
Multidimensional
Queries
Scalability
Commercial products
Relational DBs
Spatial DBs
but expensive
Open source products
Key-Value
Stores
What We Want
at a reasonable price
Page 5
Ordered Key-Value Stores
Buckets
Sorted by key
key00 value00
key01 value01
Good at 1-D Range Query
Index
key0X value0X
key00
key11
But, our target is
multi-dimensional…
key11 value11
key12 value12
key1Y value1Y
Latitude
keynn
keynn valuenn
ex. BigTable
HBase
Page 6
Time
Longitude
Naïve Solution: Linearlization
Projects n-D space to 1-D space
key00 value00
Apply a Z-ordering curve…
key01 value01
key0X value0X
5
7
13
15
key11 value11
4
6
12
14
1
3
9
11
0
2
8
10
key00
key11
key12 value12
keynn
key1Y value1Y
keynn valuenn
Simple, but problematic…
Page 7
Problem: False positive scans
▐ MD-query on Linearized space
 Translate a MD-query to
linearized range query.
• Ex. Query from 2 to 9.
 Scan queried linearized range.
 Filter points out of the queried area.
5
7
13
15
4
6
12
14
1
3
9
11
0
2
8
10
• ex. blue-hatched area (4 to 7)
Require the boundary information of
the original space.
Page 8
Our Approach: MD-HBase

Build a Multi-dimensional Index Layer on top of an Ordered KeyValue store
Ordered Key-Value Store
ex. BigTable, HBase, …
MD-HBase
Multi-Dimensional Index
Single Dimensional Index
Page 9
Introduce Multi-dimensional Index
▐ Multi-dimensional Index (ex. The K-d tree, The Quad tree)
 Divide a space into subspaces containing almost same # of points
 Organize subspaces as tree
Efficient subspace pruning → to avoid false positive scans
Divide into
Page 10
Organize as
Space Partition By the K-d tree
Binary Z-ordering space
Partitioned space by
the K-d tree
bitwise interleaving
ex. x=00, y=11 → 0101
11 0101 0111 1101 1111
11
0101
0111
1101
1111
10 0100 0110 1100 1110
10
0100 0110
1100
1110
01 0001 0011 1001 1011
01
0001 0011 1001 1011
00 0000 0010 1000 1010
00
0000 0010 1000 1010
00
01
10
11
00
01
10
11
How do we represent these subspaces?
Page 11
Key Idea: The longest common prefix naming scheme
Subspaces represented as the longest common prefix of keys!
Remarkable Property
11
0101
0111
1101
1111
• Preserve boundary information
of the original space
10
0100 0110
1100
1110
1***
01
0001 0011 1001 1011
00
0000 0010 1000 1010
00
000*
Page 12
01
10
11
1***
*→0
*→1
1000
1111
(10, 00)
(11, 11)
Left-bottom
corner
Right-top
corner
Build an index with the longest common prefix of keys
Buckets
Index
11
0101
0111
1101
1111
01**
10
0100 0110
1100
1110
1***
01
0001 0011 1001 1011
000*
00
000*
001*
01**
1***
001*
01**
001*
0000 0010 1000 1010
00
000*
01
10
1***
11
allocate per subspace
Page 13
Multi-dimensional Range Query
Scan 0010 -1001
on the index
Index Subspace Pruning
11
10
0101 0111 1101 1111
0100 0110 1100 1110
000*
00
0001 0011 1001 1011
0000 0010 1000 1010
00
01
10
001*
001*
Filter
10**
01**
10**
11**
10**
11
11**
Reconstruct the boundary Info. &
Check whether intersecting the queried area
Page 14
Scan
001*
01**
01
000*
Scan
K Nearest Neighbors Query
▐ The best first algorithm can be applied.
 the most efficient technique in practical case
▐ Check the detail in our paper
5
4
3
Page 15
1
2
Variations of Storage Layer

Table Share Model




Uses single table, Maintain bucket boundary
Most space efficiency
Bucket co-location may cause
disk access congestions
Table per Bucket Model


Allocates a table per bucket
Most flexible mapping


Bucket split is expensive


One-to-one, one-to-many, many-to-one
Copy all points to the new buckets.
Region per Bucket Model


Allocates a region per bucket
Most bucket split efficiency


Asynchronous bucket split
Requires modification of HBase
Experimental Results: Multi-dimensional Range Query



Dataset: 400,000,000 points
Queries: select objects within MD ranges and change selectivity
Cluster size: 4 nodes
MD-HBase responses 10~100 times faster than others
and responses proportional time to selectivity.
MD-HBase
Response Time (Sec)

HBase(ZOrder)
1000
100
10
1
0.01
0.1
1
Selectivity (%)
Page 17
MapReduce
10
Experimental Results: k Nearest Neighbors Query



Dataset: 400,000,000 points
Queries: choose a point and change the number of neighbors
Cluster size: 4 nodes
MD-HBase responses 1.5 sec where k ≦ 100,
and 11 sec even if k = 10,000
Response Time (Sec)

12
10
8
6
4
2
0
1
10
100
1000
k: Number of Neighbors
Page 18
10000
Experimental Results: Insert

Dataset: spatially skewed data generated by zipfian distribution
MD-HBase shows good scalability without significant overhead.
250,000
Thoughput
(records/sec)

200,000
MD-HBase
150,000
100,000
Hbase
(Zorder)
50,000
0
0
4
8
12
16
Number of nodes
Page 19
20
Conclusions

Designed a scalable multi-dimensional data store.




Scalability & Efficient multi-dimensional queries
Key Idea: indexing the longest common prefix of keys
Easily extend general ordered key-value stores.
Demonstrated scalable insert throughput and excellent query
performance.



Range Query: 10-100 times faster than existing technologies.
kNN Query: 1.5 s when k ≦ 100.
Insert: 220K inserts/sec on 16 nodes cluster without overhead
Thank you.
Any Questions?
Page 20
Download