Final Presentation - Computation Structures Group

advertisement
A Relational Algebra Processor
6.375 Final Project
Ming Liu, Shuotao Xu
Motivation
Today’s Database Management Systems (DBMS): software
running on a standard operating system on a general
purpose CPU
DBMS frequently used in analytics and scientific
computing, but bottlenecked by:



Processor speed, software overhead, latency & bandwidth
Proposal: FPGA Based Relational Algebra Processor

Host PC
(DBMS)
2
FPGA
Relational
Algebra
Processor
Physical
Storage
Background|Relational Algebra (RA)
Many database queries are fundamentally decomposable
to five basic RA operators


Although SQL is capable of much more
Operator
Functions
Selection
Filter rows based on a Boolean condition
Projection
Eliminate selected attributes (columns) of a table; remove
duplicated results
Cartesian
Product
Combine several tables with unique attributes
Union
Combine several tables with the same attributes
Difference
Select rows of several tables where the rows do not match
Design dedicated processors on the FPGA for each operator
3
Project Goal
Design and implement an in-memory relational algebra
processor on the FPGA
Explore the types of queries that can benefit from FPGA
acceleration



Secondary: Outperform SQLite!

Some assumptions:




4
32-bit wide table entries
Tables fit in memory
Max number of columns is 32
Read only
Microarchitecture | Host Software
FPGA
5
Microarchitecture | Top-Level RAProcessor
Host PC
(DBMS)
Host PC
(C++
functions)
6
RA
Processor
PCIe
RA
Processor
Physical
Storage
DRAM
Microarchitecture | Row Marshaller
Exposes a simple interface for
operators to access tables in
DRAM
Address translation, burst
aggregation, truncation &
alignment
Multiplexes requests
Table values sent/received as
32-bit bursts




7
Microarchitecture | Selection


Filters rows based on predicates
(e.g. age < 40)
16 predicate evaluators


Internally comparators
A tree of gates to qualify the
predicates

8
Max: 4 ORs of 4 ANDs
Microarchitecture | Projection



9
Select columns of a table
Column mask one-hot encoded
Do not need to buffer row; operate directly on data bursts
Microarchitecture | Binary Operators


10
Cartesian Product, Union, Difference and Deduplication
Nested loop implementation
Microarchitecture|Inter-operator Bypassing

Operators enabled concurrently; data
passed between operators


Conditions:
1.
2.
3.

No intermediate storage
A singly link of unary operators
Each operator has a single target output
No structural hazard
Software reorders and schedules the
RA commands

11
Data source/destination encoded in
command
Microarchitecture|Inter-operator Bypassing

12
Multiple 32-bit wide output FIFOs to other operators
Implementation Evaluation


Timing

Maximum Frequency: 55.786MHz

Critical Path: Row Marshaller mux
Area
13

Slice Registers: 50%

LUTs: 85%

BRAM/FIFOs: 47%
Modules
TOTAL
Row Mashaller
Controller
Selection
Projection
Cartesian Product
Union
Difference
Deduplication
Slice Registers
34649 (50%)
2804
4570
3137
739
1935
1939
1875
1822
LUTs
59328 (85%)
6627
6277
19633
654
1478
1983
1949
1970
BRAM/FIFOs
71 (47%)
0
29
0
0
0
0
0
0
Performance Benchmark | Setup

SQLite



Internal SQLite timer to report execution time of the query
Thinkpad T430, Core i7-3520M @ 2.90Ghz, 1x8GB DDR3-1600
RA Processor

Performance counters: cycles from start to ack of an operator
Table
1 table
100k x 30
Relational Algebra Query
SQL Query
SELECT,starLong,tableOut,
mass,>,80000,AND,pos_x,>,10,
OR,pos_x,<,pos_z,
OR,col12,>,col14, AND,col20,<,col21
PROJECT,starLong,tableOut,
pos_x,col19,col25,col29
SELECT * FROM starLong
WHERE mass > 80000 AND pos_x > 10
OR pos_x < pos_z
OR col12 > col14 AND col20 < col21;
SELECT pos_x,col19, col25, col29 FROM starLong;
2 tables
1k x 30
UNION,starMed1,starMed2,starUnion
2 tables
1k x 30
XPROD,starMed1,starMed2,starXprod
RENAME,starXprod,0,iOrder0,1,mass0,8,phi0
SELECT,starXprod,starFiltered, iOrder0,=,iOrder,
AND,phi0,>,1,
AND,mass0,>,mass
PROJECT,starFiltered,starOut,mass0
SELECT * FROM starMed1
UNION
SELECT * FROM starMed2;
SELECT s1.mass FROM starMed1 s1, starMed2 s2
WHERE s1.vx > s2.vx
AND s1.phi > 1
AND s1.mass > s2.mass;
1 table
100k x 30
14
Performance Benchmark | Results
Query Execution Time
Time (s) - Lower is better
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Select
Project
Union
Difference
Xprod
Dedup
Complex Join
Query
FPGA RA Processor

15
SW SQLite
Limitation: Memory Bandwidth: 200MB/s vs 12.8GB/s
Performance Benchmark | Results

Select operator most competitive with SQLite

What happens with more predicates?
Select (Filter) Execution Time with Varying
Number of Predicates
Time (s) - Lower is better
0.12
0.1
0.08
0.06
0.04
0.02
0
1
2
4
8
Number of Predicates
FPGA RA Processor
16
SW SQLite
16
Improvements

Increasing data burst width



Maximizing memory bandwidth


32-bit to 256-bit: potential 8x speedup
Area/critical path increase
Additional row buffers to buffer data
from DDR2 Memory
Larger, faster DRAM; Higher clock
speed
17
Conclusion & Future Work

Complex filtering operations performs well on the FPGA



Better than SQLite with sufficient memory bandwidth
Data intensive operators do not perform well
Future opportunities:


18
An accelerator alongside SQLite
Integration with HDD/SSD controller
Download