Toward GPUs being mainstream in analytic processing An initial argument using simple scanaggregate queries Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood <powerjg@cs.wisc.edu> DaMoN 2015 6/1/2015 UNIVERSITY OF WISCONSIN 1 Summary ▪ GPUs are energy efficient ▪ ▪ Discrete GPUs unpopular for DBMS New integrated GPUs solve the problems ▪ Scan-aggregate GPU implementation ▪ Wide bit-parallel scan ▪ Fine-grained aggregate GPU offload ▪ Up to 70% energy savings over multicore CPU ▪ Even more in the future 6/1/2015 UNIVERSITY OF WISCONSIN 2 Analytic Data is Growing ▪ Data is growing rapidly ▪ Analytic DBs increasingly important Source: IDC’s Digital Universe Study. 2012. Want: High performance 6/1/2015 Need: Low energy UNIVERSITY OF WISCONSIN 3 GPUs to the Rescue? ▪ GPUs are becoming more general ▪ Easier to program ▪ Integrated GPUs are everywhere ▪ GPUs show great promise [Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12, Satish ’10, and many others] ▪ Higher performance than CPUs ▪ Better energy efficiency ▪ Analytic DBs look like GPU workloads 6/1/2015 UNIVERSITY OF WISCONSIN 4 GPU Microarchitecture Compute Unit Graphics Processing Unit CU I-Fetch/Sched L2 Cache SP SP SP SP SP SP SP SP SP SP SP SP Register File L1 Cache 6/1/2015 UNIVERSITY OF WISCONSIN Scratchpad Cache 5 Discrete GPUs CPU chip Cores PCIe Bus Memory Bus Discrete GPU Memory Bus 6/1/2015 UNIVERSITY OF WISCONSIN 6 Discrete GPUs CPU chip Cores ➊ PCIe Bus Memory Bus Discrete GPU Memory Bus ➋ 6/1/2015 UNIVERSITY OF WISCONSIN 7 Discrete GPUs ➌ CPU chip PCIe Bus Cores Memory Bus ➍ And repeatDiscrete GPU Memory Bus 6/1/2015 UNIVERSITY OF WISCONSIN 8 Discrete GPUs ▪ Copy data over PCIe ➊ ▪ ▪ Low bandwidth High latency ▪ Small working memory ▪ High latency user→kernel calls ▪➋ Repeated many times ➌ ➍ 98% of time spent not computing 6/1/2015 UNIVERSITY OF WISCONSIN 9 Integrated GPUs Heterogeneous chip CPU cores Memory Bus GPU CUs 6/1/2015 UNIVERSITY OF WISCONSIN 10 Heterogeneous System Arch. ▪ API for tightly-integrated accelerators ▪ Industry support ▪ ▪ Initial hardware support today HSA foundation (AMD, ARM,Qualcomm, others) ▪ No need for data copies ▪ Cache coherence and shared address space ▪ No OS kernel interaction ▪ User-mode queues 6/1/2015 UNIVERSITY OF WISCONSIN ➊➋ ❹ ➌ 11 Outline ▪ Background ▪ Algorithms ▪ ▪ Scan Aggregate ▪ Results 6/1/2015 UNIVERSITY OF WISCONSIN 12 Analytic DBs ▪ Resident in main-memory ▪ Column-based layout ▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14] ▪ Convert queries to mostly scans by pre-joining tables ▪ Fast scan by using sub-word parallelism ▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU] ▪ Scan-aggregate queries 6/1/2015 UNIVERSITY OF WISCONSIN 13 Running Example Shirt Color Shirt Amount Green 2 Green 2 Blue 1 1 3 1 Green 2 Yellow 3 Red 0 5 7 2 Yellow 3 Blue 1 Yellow 3 1 4 2 6/1/2015 Color Code Red Blue Green Yellow 0 1 2 3 UNIVERSITY OF WISCONSIN 14 Running Example Shirt Color Shirt Amount Green 2 Green 2 Blue 1 1 3 1 Green 2 Yellow 3 Red 0 5 7 2 Yellow 3 Blue 1 Yellow 3 1 4 2 6/1/2015 Count the number of green shirts in the inventory ➊ Scan the color column for green (2) ➋ Aggregate amount where there is a match UNIVERSITY OF WISCONSIN 15 Traditional Scan Algorithm Column Data Compare Code (Green) Result BitVector 10 101 Shirt Color 10 10 10 2 (10) 2 (10) 1 (01) 11010000 110 1 1 0000... 2 (10) 3 (11) 0 (00) 3 (11) 1 (01) 3 (11) 6/1/2015 UNIVERSITY OF WISCONSIN 16 Vertical Layout Color word c0 c1 c2 c3 c4 c5 c6 c7 c0 2 (10) w0 1 1 0 1 1 0 1 0 c1 2 (10) w1 0 0 1 0 1 0 1 1 c2 1 (01) c3 2 (10) w2 1 0 c4 3 (11) w3 1 0 c5 0 (00) c6 3 (11) c7 1 (01) c8 3 (11) c9 0 (00) 6/1/2015 c8 c9 110110110 00101011 10000000 ... UNIVERSITY OF WISCONSIN 17 CPU BitWeaving Scan Column Data 11011011 00101011 10000000 Compare Code 11111111 00000000 Result BitVector 1101 11010000 0000... CPU width: 64-bits, up to 256-bit SIMD 6/1/2015 UNIVERSITY OF WISCONSIN 18 GPU BitWeaving Scan Column Data 11011011 00101011 10000000 Compare Code 11111111 11111111 11111111 Result BitVector 11010000 0000... GPU width: 16,384-bit SIMD 6/1/2015 UNIVERSITY OF WISCONSIN 19 GPU Scan Algorithm ▪ GPU uses very wide “words” ▪ ▪ CPU: 64-bits or 256-bits with SIMD GPU: 16,384 bits (256 lanes × 64-bits) ▪ Memory and caches optimized for bandwidth ▪ HSA programming model ▪ ▪ 6/1/2015 No data copies Low CPU-GPU interaction overhead UNIVERSITY OF WISCONSIN 20 CPU Aggregate Algorithm Result BitVector 11010000 0000... Shirt Amount 1 3 1 Result 5 7 1+3 1+3+5+... 2 1 4 2 6/1/2015 UNIVERSITY OF WISCONSIN 21 GPU Aggregate Algorithm Result BitVector 11010000 0000... Column Offsets 0,1 0,1,3,... 6/1/2015 UNIVERSITY OF WISCONSIN On CPU 22 GPU Aggregate Algorithm Column Offsets Shirt Amount 0,1,3,... 1 3 On GPU Result 1+3+5+... 1 5 7 2 1 4 2 6/1/2015 UNIVERSITY OF WISCONSIN 23 Aggregate Algorithm ▪ Two phases ▪ ▪ Convert from BitVector to offsets (on CPU) Materialize data and compute (offload to GPU) ▪ Two group-by algorithms (see paper) ▪ HSA programming model ▪ ▪ 6/1/2015 Fine-grained sharing Can offload subset of computation UNIVERSITY OF WISCONSIN 24 Outline ▪ Background ▪ Algorithms ▪ Results 6/1/2015 UNIVERSITY OF WISCONSIN 25 Experimental Methods ▪ AMD A10-7850 ▪ ▪ ▪ ▪ 4-core CPU 8-compute unit GPU 16GB capacity, 21 GB/s DDR3 memory Separate discrete GPU ▪ Watts-Up meter for full-system power ▪ TPC-H @ scale-factor 10 6/1/2015 UNIVERSITY OF WISCONSIN 26 Scan Performance & Energy 6/1/2015 UNIVERSITY OF WISCONSIN 27 Scan Performance & Energy Takeaway: Integrated GPU most efficient for scans 6/1/2015 UNIVERSITY OF WISCONSIN 28 TPC-H Queries Query 12 Performance 6/1/2015 UNIVERSITY OF WISCONSIN 29 TPC-H Queries Query 12 Performance Query 12 Energy Integrated GPU faster for both aggregate and scan computation 6/1/2015 UNIVERSITY OF WISCONSIN 30 TPC-H Queries Query 12 Performance 6/1/2015 Query 12 Energy UNIVERSITY OF WISCONSIN 31 TPC-H Queries Query 12 Performance Query 12 Energy More energy: Decrease in latency does not offset power increase Less energy: Decrease in latency AND decrease in power 6/1/2015 UNIVERSITY OF WISCONSIN 32 Future Die Stacked GPUs ▪ 3D die stacking ▪ Same physical & logical integration ▪ Increased compute ▪ Increased bandwidth DRAM GPU CPU Board Power et al. Implications of 3D GPUs on the Scan Primitive SIGMOD Record. Volume 44, Issue 1. March 2015 6/1/2015 UNIVERSITY OF WISCONSIN 33 Conclusions Discrete GPUs Integrated GPUs Performance High ☺ Moderate Memory Bandwidth High ☺ Low Overhead High ☹ Memory Capacity Low ☹ 6/1/2015 3D Stacked GPUs High ☺ ☹ High ☺ Low ☺ Low ☺ High ☺ Moderate UNIVERSITY OF WISCONSIN 34 6/1/2015 UNIVERSITY OF WISCONSIN 35 HSA vs CUDA/OpenCL ▪ HSA defines a heterogeneous architecture ▪ ▪ ▪ ▪ Cache coherence Shared virtual addresses Architected queuing Intermediate language ▪ CUDA/OpenCL are a level above HSA ▪ ▪ ▪ 6/1/2015 Come with baggage Not as flexible May not be able to take advantage of all features UNIVERSITY OF WISCONSIN 36 Scan Performance & Energy 6/1/2015 UNIVERSITY OF WISCONSIN 37 Group-by Algorithms 6/1/2015 UNIVERSITY OF WISCONSIN 38 All TPC-H Results 6/1/2015 UNIVERSITY OF WISCONSIN 39 Average TPC-H Results Average Performance 6/1/2015 Average Energy UNIVERSITY OF WISCONSIN 40 What’s Next? ▪ Developing cost model for GPU ▪ ▪ Using the GPU is just another algorithm to choose Evaluate exactly when the GPU is more efficient ▪ Future “database machines” ▪ 6/1/2015 GPUs are a good tradeoff between specialization and commodity UNIVERSITY OF WISCONSIN 41 Conclusions ▪ Integrated GPUs viable for DBMS? ▪ ▪ Solve problems with discrete GPUs (Somewhat) better performance and energy ▪ Looking toward the future... ▪ CPUs cannot keep up with bandwidth ▪ GPUs perfectly designed for these workloads 6/1/2015 UNIVERSITY OF WISCONSIN 42