Indexing HDFS Data in PDW: Splitting the data from index Vinitha Gankidi#, Nikhil Teletia*, Jignesh M. Patel#, Alan Halverson*, David J. DeWitt* University of Wisconsin-Madison * Microsoft Jim Gray Systems Lab # 1 Motivation SQL SQL Server PDW with Polybase Result Data lives in two worlds Hybrid SQL-On-Hadoop solutions (Microsoft PolyBase, Teradata QueryGrid, IBM Big SQL etc.) Hot Data Cold Data RDBMS HDFS Familiar SQL interface Load first, schema later Decades of research and optimization Cheap and scalable data store 2 Query Execution over External Data SELECT * FROM hdfs_Employee The HDFS files haveWHERE to be DeptID entirely= 1imported 1 Import HDFS files into PDW IMPORT PATH 2 Run the rest of the query inside PDW IDID DeptID DeptID ID Name DeptID 101 A 1 101 101 PUSH-DOWN PATH Name Name 102 102 AA BB 11 22 102for MAP B 2 Significant startup overhead task 103 C 3 1 Run a Map job to filter 103 103 C C 3 3 intofiles PDWare scanned entirely 2 Import the result of the AllMap thejob HDFS HDFS 3 Run the rest of the query inside PDW 3 What is a Split-Index? 1. Index is stored in RDBMS, while the data is in HDFS 2. Index is stored as a RDBMS table • Hash-partitioned across multiple node • Each partition has clustered B+ tree Index RDBMS IDID ID 101 101 101 102 102 Dept ID 1 Name Name Name AA A BB 102viewB Split-Index is similar to a materialized 103 103 CC HDFS an external HDFS REC 103 C (with pointer) File offset Len Split-Index can be out-of-sync with the data Name HDFS file1 0 10 DeptID DeptID DeptID 11 1 22 2 33 3 4 Query Execution using Split-Index SELECT name FROM hdfs_Employee WHERE DeptID = 1 DeptID DeptID File offset offset Len Len File HDFS Name HDFS REC Name 4 1 Name A Return the result … 2 IDID Name Name DeptID DeptID SELECT [HDFS File Name], ID Name DeptID [HDFS Offset], [Rec Len] 101 101 AA 11 File offset Len FROM index_Employee 1 file1 0 10 Name 101 ABBhaving 122 to Using index, we can answer queries without 102 1 file1 0 10 102 WHERE DeptID = 1 1 22 file1 01010 10 102 BC 23 file1 10 sequentially scan each HDFS file. file1 10 103 103 C 3 Qualifying Tuples 2 3 file1 1020 1010 103 C 3 file1 Index_Emp (Index on DeptID)10 3 file1 20 HDFS HDFS REC 3 file1 20 10 HDFS File offset Len 3 Name RDBMS Retrieve qualifying tuples file1 0 10 from HDFS files. file2 100 10 5 Dept ID Incremental Index Update • Given the append-only property of the HDFS data, index can be updated incrementally • A new HDFS file is added • Append the rows of the new file to the existing index • An HDFS file is deleted • Delete the rows of the deleted file from the existing index 6 Hybrid Scan • A stale Split-Index can still be used during query execution • Examples: • An HDFS file is added • Scan the new file using non-index approach • Process existing files using index • An HDFS file is deleted • When probing the index, remove the rows associated with the deleted file 7 Experiments 8 Split-Index Performance • Cluster • 9 Node SQL Server PDW cluster (8 compute nodes + 1 control node) • 29 Node Windows HDP 2.0 cluster (28 data nodes + 1 name node) • Data Set • 10 TB Scale Lineitem table • Compare Push-Down approach with Split-Index approach COST Map Cost RID Materialize Cost Data Import Cost Data Import Cost Push-Down Approach Split-Index Approach 9 Split-Index Performance SELECT * FROM lineitem WHERE l_orderkey <= [Variable] RID Materialize Cost Data Import Cost Map Cost Split-Index on l_orderkey Execution Time (in seconds) 20000 16209 15000 Data Size: ~800GB Index Size: ~80GB 12248 11694 10000 6563 5000 0 4 617 Map Cost RID Materialize Cost Data Import Cost Data Import Cost Index performance is sensitive to the access pattern. Push-down Split-index Push-down Split-index Push-down Split-index Rifle-Shot (1 tuple) 1% (600M tuples) Predicate Selectivity 10% (6B tuples) Push-Down Approach Split-Index Approach 10 Space vs. Time Trade off • Cost of storing the data in RDBMS is higher compared to HDFS • Split-Index can SELECT be used as a covering index SUM(l_extendedprice*l_discount) ASwe REVENUE • Quantify the performance and space trade-off as move FROMto lineitem columns from HDFS PDW WHERE l_shipdate >= '1994-01-01' • Experiment Setup AND l_shipdate < • 1 TB Scale Lineitem dateadd(mm, 1, cast('1994-01-01' as date)) • Modified Query 6 AND l_discount BETWEEN .06 - 0.01 AND .06 + 0.01 AND l_quantity < 24 11 Space vs. Time Trade off PDW Disk footprint (GB) Split-Index can be used to balance the query The Lineitem 500table is in PDW. No index. execution time and the PDW disk footprint 400 The Lineitem table is in HDFS Split-Index on l_shipdate, l_discount, l_quantity, l_extendedprice 300 200 The Lineitem table is in HDFS. No index. (Push-Down) The Lineitem table is in HDFS Split-Index on l_shipdate 100 0 0 100 200 300 400 500 Execution time (in seconds) 600 700 12 Conclusions and Future Work A simple “Split-Index” mechanism can be used to achieve low-latency on highly-selective queries, with minimal system changes Incremental index update reduces the cost of maintaining the Split-Index; Hybrid scan allows using the stale Split-Index Future Work: Query optimization to use the Split-Index, and automatic physical schema designer for the Split Index(es). 13