MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services S. Nishimura (NEC Service Platforms Labs.), S. Das, D. Agrawal, A. Abbadi (University of California, Santa Barbara) Presenter: Zhuo Liu Overview ▐ A Motivating Story ▐ Existing Technologies ▐ Our proposal ▐ Evaluation ▐ Conclusion Page 2 Motivating Scenario: Mobile Coupon Distribution Mobile Coupon Distributer Current Location Current Current Location Location Coupon Page 3 Distribution Policy • Area • # of coupons Motivating Scenario: Mobile Coupon Distribution System Scalability Efficient Complex Queries Large amounts of Data High Throughput Multi-Dimensional Query Nearest Neighbors Query Current Location Current Current Location Location Current Current Location Location Current Current Location Location Current Current Current Current Location Location Current Location Location Location 125,000,000 subscribers in Japan Page 4 Coupon Coupon Coupon Distribution Policy • Area • # of coupons Existing Technologies Multidimensional Queries Scalability Commercial products Relational DBs Spatial DBs but expensive Open source products Key-Value Stores What We Want at a reasonable price Page 5 Ordered Key-Value Stores Buckets Sorted by key key00 value00 key01 value01 Good at 1-D Range Query Index key0X value0X key00 key11 But, our target is multi-dimensional… key11 value11 key12 value12 key1Y value1Y Latitude keynn keynn valuenn ex. BigTable HBase Page 6 Time Longitude Naïve Solution: Linearlization Projects n-D space to 1-D space key00 value00 Apply a Z-ordering curve… key01 value01 key0X value0X 5 7 13 15 key11 value11 4 6 12 14 1 3 9 11 0 2 8 10 key00 key11 key12 value12 keynn key1Y value1Y keynn valuenn Simple, but problematic… Page 7 Problem: False positive scans ▐ MD-query on Linearized space Translate a MD-query to linearized range query. • Ex. Query from 2 to 9. Scan queried linearized range. Filter points out of the queried area. 5 7 13 15 4 6 12 14 1 3 9 11 0 2 8 10 • ex. blue-hatched area (4 to 7) Require the boundary information of the original space. Page 8 Our Approach: MD-HBase Build a Multi-dimensional Index Layer on top of an Ordered KeyValue store Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase Multi-Dimensional Index Single Dimensional Index Page 9 Introduce Multi-dimensional Index ▐ Multi-dimensional Index (ex. The K-d tree, The Quad tree) Divide a space into subspaces containing almost same # of points Organize subspaces as tree Efficient subspace pruning → to avoid false positive scans Divide into Page 10 Organize as Space Partition By the K-d tree Binary Z-ordering space Partitioned space by the K-d tree bitwise interleaving ex. x=00, y=11 → 0101 11 0101 0111 1101 1111 11 0101 0111 1101 1111 10 0100 0110 1100 1110 10 0100 0110 1100 1110 01 0001 0011 1001 1011 01 0001 0011 1001 1011 00 0000 0010 1000 1010 00 0000 0010 1000 1010 00 01 10 11 00 01 10 11 How do we represent these subspaces? Page 11 Key Idea: The longest common prefix naming scheme Subspaces represented as the longest common prefix of keys! Remarkable Property 11 0101 0111 1101 1111 • Preserve boundary information of the original space 10 0100 0110 1100 1110 1*** 01 0001 0011 1001 1011 00 0000 0010 1000 1010 00 000* Page 12 01 10 11 1*** *→0 *→1 1000 1111 (10, 00) (11, 11) Left-bottom corner Right-top corner Build an index with the longest common prefix of keys Buckets Index 11 0101 0111 1101 1111 01** 10 0100 0110 1100 1110 1*** 01 0001 0011 1001 1011 000* 00 000* 001* 01** 1*** 001* 01** 001* 0000 0010 1000 1010 00 000* 01 10 1*** 11 allocate per subspace Page 13 Multi-dimensional Range Query Scan 0010 -1001 on the index Index Subspace Pruning 11 10 0101 0111 1101 1111 0100 0110 1100 1110 000* 00 0001 0011 1001 1011 0000 0010 1000 1010 00 01 10 001* 001* Filter 10** 01** 10** 11** 10** 11 11** Reconstruct the boundary Info. & Check whether intersecting the queried area Page 14 Scan 001* 01** 01 000* Scan K Nearest Neighbors Query ▐ The best first algorithm can be applied. the most efficient technique in practical case ▐ Check the detail in our paper 5 4 3 Page 15 1 2 Variations of Storage Layer Table Share Model Uses single table, Maintain bucket boundary Most space efficiency Bucket co-location may cause disk access congestions Table per Bucket Model Allocates a table per bucket Most flexible mapping Bucket split is expensive One-to-one, one-to-many, many-to-one Copy all points to the new buckets. Region per Bucket Model Allocates a region per bucket Most bucket split efficiency Asynchronous bucket split Requires modification of HBase Experimental Results: Multi-dimensional Range Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 4 nodes MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity. MD-HBase Response Time (Sec) HBase(ZOrder) 1000 100 10 1 0.01 0.1 1 Selectivity (%) Page 17 MapReduce 10 Experimental Results: k Nearest Neighbors Query Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 4 nodes MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000 Response Time (Sec) 12 10 8 6 4 2 0 1 10 100 1000 k: Number of Neighbors Page 18 10000 Experimental Results: Insert Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead. 250,000 Thoughput (records/sec) 200,000 MD-HBase 150,000 100,000 Hbase (Zorder) 50,000 0 0 4 8 12 16 Number of nodes Page 19 20 Conclusions Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores. Demonstrated scalable insert throughput and excellent query performance. Range Query: 10-100 times faster than existing technologies. kNN Query: 1.5 s when k ≦ 100. Insert: 220K inserts/sec on 16 nodes cluster without overhead Thank you. Any Questions? Page 20