Toward GPUs being mainstream in analytic processing aggregate queries

advertisement
Toward GPUs being mainstream in
analytic processing
An initial argument using simple scanaggregate queries
Jason Power || Yinan Li || Mark D. Hill
Jignesh M. Patel || David A. Wood
<powerjg@cs.wisc.edu>
DaMoN 2015
6/1/2015
UNIVERSITY OF WISCONSIN
1
Summary
▪ GPUs are energy efficient
▪
▪
Discrete GPUs unpopular for DBMS
New integrated GPUs solve the problems
▪ Scan-aggregate GPU implementation
▪ Wide bit-parallel scan
▪ Fine-grained aggregate GPU offload
▪ Up to 70% energy savings over multicore CPU
▪ Even more in the future
6/1/2015
UNIVERSITY OF WISCONSIN
2
Analytic Data is Growing
▪ Data is growing
rapidly
▪ Analytic DBs
increasingly important
Source: IDC’s Digital Universe Study. 2012.
Want: High performance
6/1/2015
Need: Low energy
UNIVERSITY OF WISCONSIN
3
GPUs to the Rescue?
▪ GPUs are becoming more general
▪ Easier to program
▪ Integrated GPUs are everywhere
▪ GPUs show great promise [Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12,
Satish ’10, and many others]
▪ Higher performance than CPUs
▪ Better energy efficiency
▪ Analytic DBs look like GPU workloads
6/1/2015
UNIVERSITY OF WISCONSIN
4
GPU Microarchitecture
Compute Unit
Graphics Processing Unit
CU
I-Fetch/Sched
L2 Cache
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
Register File
L1 Cache
6/1/2015
UNIVERSITY OF WISCONSIN
Scratchpad
Cache
5
Discrete GPUs
CPU chip
Cores
PCIe Bus
Memory Bus
Discrete GPU
Memory Bus
6/1/2015
UNIVERSITY OF WISCONSIN
6
Discrete GPUs
CPU chip
Cores
➊
PCIe Bus
Memory Bus
Discrete GPU
Memory Bus
➋
6/1/2015
UNIVERSITY OF WISCONSIN
7
Discrete GPUs
➌
CPU chip
PCIe Bus
Cores
Memory Bus
➍
And repeatDiscrete GPU
Memory Bus
6/1/2015
UNIVERSITY OF WISCONSIN
8
Discrete GPUs
▪ Copy data over PCIe
➊
▪
▪
Low bandwidth
High latency
▪ Small working memory
▪ High latency user→kernel calls
▪➋ Repeated many times
➌
➍
98% of time spent not computing
6/1/2015
UNIVERSITY OF WISCONSIN
9
Integrated GPUs
Heterogeneous
chip
CPU cores
Memory Bus
GPU CUs
6/1/2015
UNIVERSITY OF WISCONSIN
10
Heterogeneous System Arch.
▪ API for tightly-integrated accelerators
▪ Industry support
▪
▪
Initial hardware support today
HSA foundation (AMD, ARM,Qualcomm, others)
▪ No need for data copies
▪ Cache coherence and shared address space
▪ No OS kernel interaction
▪ User-mode queues
6/1/2015
UNIVERSITY OF WISCONSIN
➊➋
❹
➌
11
Outline
▪ Background
▪ Algorithms
▪
▪
Scan
Aggregate
▪ Results
6/1/2015
UNIVERSITY OF WISCONSIN
12
Analytic DBs
▪ Resident in main-memory
▪ Column-based layout
▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14]
▪ Convert queries to mostly scans by pre-joining tables
▪ Fast scan by using sub-word parallelism
▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU]
▪ Scan-aggregate queries
6/1/2015
UNIVERSITY OF WISCONSIN
13
Running Example
Shirt
Color
Shirt
Amount
Green
2
Green
2
Blue
1
1
3
1
Green
2
Yellow
3
Red
0
5
7
2
Yellow
3
Blue
1
Yellow
3
1
4
2
6/1/2015
Color
Code
Red
Blue
Green
Yellow
0
1
2
3
UNIVERSITY OF WISCONSIN
14
Running Example
Shirt
Color
Shirt
Amount
Green
2
Green
2
Blue
1
1
3
1
Green
2
Yellow
3
Red
0
5
7
2
Yellow
3
Blue
1
Yellow
3
1
4
2
6/1/2015
Count the number of
green shirts in the
inventory
➊
Scan the color
column for green (2)
➋
Aggregate amount
where there is a match
UNIVERSITY OF WISCONSIN
15
Traditional Scan Algorithm
Column
Data
Compare
Code
(Green)
Result
BitVector
10
101
Shirt
Color
10
10
10
2 (10)
2 (10)
1 (01)
11010000
110
1
1
0000...
2 (10)
3 (11)
0 (00)
3 (11)
1 (01)
3 (11)
6/1/2015
UNIVERSITY OF WISCONSIN
16
Vertical Layout
Color
word
c0 c1
c2 c3 c4 c5 c6 c7
c0
2 (10)
w0
1
1
0
1
1
0
1
0
c1
2 (10)
w1
0
0
1
0
1
0
1
1
c2
1 (01)
c3
2 (10)
w2
1
0
c4
3 (11)
w3
1
0
c5
0 (00)
c6
3 (11)
c7
1 (01)
c8
3 (11)
c9
0 (00)
6/1/2015
c8 c9
110110110 00101011 10000000 ...
UNIVERSITY OF WISCONSIN
17
CPU BitWeaving Scan
Column
Data
11011011 00101011 10000000
Compare
Code
11111111 00000000
Result
BitVector
1101
11010000
0000...
CPU width: 64-bits, up to 256-bit SIMD
6/1/2015
UNIVERSITY OF WISCONSIN
18
GPU BitWeaving Scan
Column
Data
11011011 00101011 10000000
Compare
Code
11111111 11111111 11111111
Result
BitVector
11010000 0000...
GPU width: 16,384-bit SIMD
6/1/2015
UNIVERSITY OF WISCONSIN
19
GPU Scan Algorithm
▪ GPU uses very wide “words”
▪
▪
CPU: 64-bits or 256-bits with SIMD
GPU: 16,384 bits (256 lanes × 64-bits)
▪ Memory and caches optimized for bandwidth
▪ HSA programming model
▪
▪
6/1/2015
No data copies
Low CPU-GPU interaction overhead
UNIVERSITY OF WISCONSIN
20
CPU Aggregate Algorithm
Result
BitVector
11010000 0000...
Shirt
Amount
1
3
1
Result
5
7
1+3
1+3+5+...
2
1
4
2
6/1/2015
UNIVERSITY OF WISCONSIN
21
GPU Aggregate Algorithm
Result
BitVector
11010000 0000...
Column
Offsets
0,1
0,1,3,...
6/1/2015
UNIVERSITY OF WISCONSIN
On CPU
22
GPU Aggregate Algorithm
Column
Offsets
Shirt
Amount
0,1,3,...
1
3
On GPU
Result
1+3+5+...
1
5
7
2
1
4
2
6/1/2015
UNIVERSITY OF WISCONSIN
23
Aggregate Algorithm
▪ Two phases
▪
▪
Convert from BitVector to offsets (on CPU)
Materialize data and compute (offload to GPU)
▪ Two group-by algorithms (see paper)
▪ HSA programming model
▪
▪
6/1/2015
Fine-grained sharing
Can offload subset of computation
UNIVERSITY OF WISCONSIN
24
Outline
▪ Background
▪ Algorithms
▪ Results
6/1/2015
UNIVERSITY OF WISCONSIN
25
Experimental Methods
▪ AMD A10-7850
▪
▪
▪
▪
4-core CPU
8-compute unit GPU
16GB capacity, 21 GB/s DDR3 memory
Separate discrete GPU
▪ Watts-Up meter for full-system power
▪ TPC-H @ scale-factor 10
6/1/2015
UNIVERSITY OF WISCONSIN
26
Scan Performance & Energy
6/1/2015
UNIVERSITY OF WISCONSIN
27
Scan Performance & Energy
Takeaway:
Integrated GPU most efficient for scans
6/1/2015
UNIVERSITY OF WISCONSIN
28
TPC-H Queries
Query 12
Performance
6/1/2015
UNIVERSITY OF WISCONSIN
29
TPC-H Queries
Query 12
Performance
Query 12 Energy
Integrated GPU faster
for both aggregate and
scan computation
6/1/2015
UNIVERSITY OF WISCONSIN
30
TPC-H Queries
Query 12
Performance
6/1/2015
Query 12 Energy
UNIVERSITY OF WISCONSIN
31
TPC-H Queries
Query 12
Performance
Query 12 Energy
More energy:
Decrease in latency does
not offset power increase
Less energy:
Decrease in latency AND
decrease in power
6/1/2015
UNIVERSITY OF WISCONSIN
32
Future Die Stacked GPUs
▪ 3D die stacking
▪ Same physical &
logical integration
▪ Increased compute
▪ Increased bandwidth
DRAM
GPU
CPU
Board
Power et al.
Implications of 3D GPUs on the Scan Primitive
SIGMOD Record. Volume 44, Issue 1. March 2015
6/1/2015
UNIVERSITY OF WISCONSIN
33
Conclusions
Discrete
GPUs
Integrated
GPUs
Performance High
☺
Moderate
Memory
Bandwidth
High
☺
Low
Overhead
High
☹
Memory
Capacity
Low
☹
6/1/2015
3D Stacked
GPUs
High
☺
☹
High
☺
Low
☺
Low
☺
High
☺
Moderate
UNIVERSITY OF WISCONSIN
34
6/1/2015
UNIVERSITY OF WISCONSIN
35
HSA vs CUDA/OpenCL
▪ HSA defines a heterogeneous architecture
▪
▪
▪
▪
Cache coherence
Shared virtual addresses
Architected queuing
Intermediate language
▪ CUDA/OpenCL are a level above HSA
▪
▪
▪
6/1/2015
Come with baggage
Not as flexible
May not be able to take advantage of all features
UNIVERSITY OF WISCONSIN
36
Scan Performance & Energy
6/1/2015
UNIVERSITY OF WISCONSIN
37
Group-by Algorithms
6/1/2015
UNIVERSITY OF WISCONSIN
38
All TPC-H Results
6/1/2015
UNIVERSITY OF WISCONSIN
39
Average TPC-H Results
Average
Performance
6/1/2015
Average Energy
UNIVERSITY OF WISCONSIN
40
What’s Next?
▪ Developing cost model for GPU
▪
▪
Using the GPU is just another algorithm to choose
Evaluate exactly when the GPU is more efficient
▪ Future “database machines”
▪
6/1/2015
GPUs are a good tradeoff between specialization and
commodity
UNIVERSITY OF WISCONSIN
41
Conclusions
▪ Integrated GPUs viable for DBMS?
▪
▪
Solve problems with discrete GPUs
(Somewhat) better performance and energy
▪ Looking toward the future...
▪ CPUs cannot keep up with bandwidth
▪ GPUs perfectly designed for these workloads
6/1/2015
UNIVERSITY OF WISCONSIN
42
Download