A Relational Algebra Processor 6.375 Final Project Ming Liu, Shuotao Xu Motivation Today’s Database Management Systems (DBMS): software running on a standard operating system on a general purpose CPU DBMS frequently used in analytics and scientific computing, but bottlenecked by: Processor speed, software overhead, latency & bandwidth Proposal: FPGA Based Relational Algebra Processor Host PC (DBMS) 2 FPGA Relational Algebra Processor Physical Storage Background|Relational Algebra (RA) Many database queries are fundamentally decomposable to five basic RA operators Although SQL is capable of much more Operator Functions Selection Filter rows based on a Boolean condition Projection Eliminate selected attributes (columns) of a table; remove duplicated results Cartesian Product Combine several tables with unique attributes Union Combine several tables with the same attributes Difference Select rows of several tables where the rows do not match Design dedicated processors on the FPGA for each operator 3 Project Goal Design and implement an in-memory relational algebra processor on the FPGA Explore the types of queries that can benefit from FPGA acceleration Secondary: Outperform SQLite! Some assumptions: 4 32-bit wide table entries Tables fit in memory Max number of columns is 32 Read only Microarchitecture | Host Software FPGA 5 Microarchitecture | Top-Level RAProcessor Host PC (DBMS) Host PC (C++ functions) 6 RA Processor PCIe RA Processor Physical Storage DRAM Microarchitecture | Row Marshaller Exposes a simple interface for operators to access tables in DRAM Address translation, burst aggregation, truncation & alignment Multiplexes requests Table values sent/received as 32-bit bursts 7 Microarchitecture | Selection Filters rows based on predicates (e.g. age < 40) 16 predicate evaluators Internally comparators A tree of gates to qualify the predicates 8 Max: 4 ORs of 4 ANDs Microarchitecture | Projection 9 Select columns of a table Column mask one-hot encoded Do not need to buffer row; operate directly on data bursts Microarchitecture | Binary Operators 10 Cartesian Product, Union, Difference and Deduplication Nested loop implementation Microarchitecture|Inter-operator Bypassing Operators enabled concurrently; data passed between operators Conditions: 1. 2. 3. No intermediate storage A singly link of unary operators Each operator has a single target output No structural hazard Software reorders and schedules the RA commands 11 Data source/destination encoded in command Microarchitecture|Inter-operator Bypassing 12 Multiple 32-bit wide output FIFOs to other operators Implementation Evaluation Timing Maximum Frequency: 55.786MHz Critical Path: Row Marshaller mux Area 13 Slice Registers: 50% LUTs: 85% BRAM/FIFOs: 47% Modules TOTAL Row Mashaller Controller Selection Projection Cartesian Product Union Difference Deduplication Slice Registers 34649 (50%) 2804 4570 3137 739 1935 1939 1875 1822 LUTs 59328 (85%) 6627 6277 19633 654 1478 1983 1949 1970 BRAM/FIFOs 71 (47%) 0 29 0 0 0 0 0 0 Performance Benchmark | Setup SQLite Internal SQLite timer to report execution time of the query Thinkpad T430, Core i7-3520M @ 2.90Ghz, 1x8GB DDR3-1600 RA Processor Performance counters: cycles from start to ack of an operator Table 1 table 100k x 30 Relational Algebra Query SQL Query SELECT,starLong,tableOut, mass,>,80000,AND,pos_x,>,10, OR,pos_x,<,pos_z, OR,col12,>,col14, AND,col20,<,col21 PROJECT,starLong,tableOut, pos_x,col19,col25,col29 SELECT * FROM starLong WHERE mass > 80000 AND pos_x > 10 OR pos_x < pos_z OR col12 > col14 AND col20 < col21; SELECT pos_x,col19, col25, col29 FROM starLong; 2 tables 1k x 30 UNION,starMed1,starMed2,starUnion 2 tables 1k x 30 XPROD,starMed1,starMed2,starXprod RENAME,starXprod,0,iOrder0,1,mass0,8,phi0 SELECT,starXprod,starFiltered, iOrder0,=,iOrder, AND,phi0,>,1, AND,mass0,>,mass PROJECT,starFiltered,starOut,mass0 SELECT * FROM starMed1 UNION SELECT * FROM starMed2; SELECT s1.mass FROM starMed1 s1, starMed2 s2 WHERE s1.vx > s2.vx AND s1.phi > 1 AND s1.mass > s2.mass; 1 table 100k x 30 14 Performance Benchmark | Results Query Execution Time Time (s) - Lower is better 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Select Project Union Difference Xprod Dedup Complex Join Query FPGA RA Processor 15 SW SQLite Limitation: Memory Bandwidth: 200MB/s vs 12.8GB/s Performance Benchmark | Results Select operator most competitive with SQLite What happens with more predicates? Select (Filter) Execution Time with Varying Number of Predicates Time (s) - Lower is better 0.12 0.1 0.08 0.06 0.04 0.02 0 1 2 4 8 Number of Predicates FPGA RA Processor 16 SW SQLite 16 Improvements Increasing data burst width Maximizing memory bandwidth 32-bit to 256-bit: potential 8x speedup Area/critical path increase Additional row buffers to buffer data from DDR2 Memory Larger, faster DRAM; Higher clock speed 17 Conclusion & Future Work Complex filtering operations performs well on the FPGA Better than SQLite with sufficient memory bandwidth Data intensive operators do not perform well Future opportunities: 18 An accelerator alongside SQLite Integration with HDD/SSD controller